Preprint
Article

This version is not peer-reviewed.

Modern Continual Learning with Foundation Models, Evaluation Challenges, and Future Directions

Submitted:

18 May 2026

Posted:

19 May 2026

You are already at the latest version

Abstract
Continual learning (CL), also referred to as lifelong learning, aims to develop intelligent systems capable of learning continuously from sequential data while retaining previously acquired knowledge. As AI systems are increasingly deployed in dynamic real-world environments, CL has become essential for enabling long-term adaptation without catastrophic forgetting. This review provides a structured overview of major CL paradigms, including task-incremental, domain-incremental, class-incremental, online, multimodal, and federated CL. We examine the theoretical foundations of CL, particularly the stability-plasticity dilemma, catastrophic forgetting, transfer dynamics, and representation learning. In addition, we analyze major methodological categories, including regularization-based, replay-based, architecture-based, optimization-based, representation-learning, and parameter-efficient approaches. Recent developments involving transformers, prompt learning, foundation models, and multimodal adaptation are also discussed as emerging directions in modern CL research. Furthermore, this review highlights important issues related to benchmark fragmentation, evaluation inconsistency, memory constraints, computational efficiency, scalability, and privacy-aware learning. We also summarize key application domains, including computer vision, natural language processing, robotics, healthcare, and medical imaging. Finally, we identify open research challenges and future directions toward scalable, reliable, and deployment-oriented lifelong learning systems capable of operating effectively in continuously evolving environments.
Keywords: 
;  ;  ;  ;  

1. Introduction

Artificial intelligence (AI) enables computational systems to perform tasks that typically require human intelligence, including reasoning, learning, perception, language understanding, and decision-making [1,2,3]. AI systems can analyze data, recognize patterns, generate language, and support decision-making across a wide range of domains [4,5,6]. However, many conventional machine learning (ML) models are trained on fixed datasets and deployed under the assumption that the data distribution will remain stable [7,8]. This assumption is often unrealistic in dynamic environments, where new classes may emerge, data distributions may shift, and previously learned knowledge may become insufficient or outdated [9].
Continual learning (CL), also referred to as incremental or lifelong learning, addresses this limitation by enabling models to learn from sequential data while retaining previously acquired knowledge [10,11]. This capability is essential for AI systems operating in real-world environments, where models must adapt to changing conditions, incorporate new tasks, and maintain reliable performance over time [12,13]. Without effective CL mechanisms, neural networks are vulnerable to catastrophic forgetting, in which learning new information causes performance degradation on earlier tasks [12,14]. Therefore, the central goal of CL is to balance plasticity, the ability to acquire new knowledge, with stability, the ability to preserve prior knowledge.
The need for CL has become more urgent with the rapid growth of continuous data streams from sensor networks, financial systems, social media platforms, Internet of Things devices, and other real-time sources [15,16,17,18]. Unlike traditional batch learning, these settings require models that can update incrementally as new data arrive, without complete retraining from scratch [11,19]. For example, an autonomous driving model trained under clear weather may need to adapt to rain, snow, or fog, while a fraud detection system must continuously respond to new fraudulent behaviors. CL provides a framework for such adaptation by allowing models to refine their parameters in response to evolving data distributions while preserving useful prior knowledge [12,20].
Beyond adaptability, CL also supports computational efficiency and sustainability. Retraining large models whenever new data become available is often costly, especially in resource-constrained environments such as mobile devices, embedded systems, robotics, and edge AI platforms [11]. By updating models incrementally, CL can reduce computational overhead and support long-term deployment in dynamic environments. These properties make CL relevant to a wide range of applications, including autonomous systems, healthcare, personalized recommendation, natural language processing, robotics, and adaptive decision-support systems.
Despite substantial progress, CL research remains fragmented. Existing studies differ in learning scenarios, task construction, evaluation protocols, memory budgets, and assumptions about task identity. Moreover, recent advances in transformers, prompt learning, parameter-efficient fine-tuning, foundation models, diffusion models, multimodal learning, and privacy-preserving learning have introduced new opportunities and challenges that are not fully synthesized in earlier surveys. This review therefore aims to provide an updated and structured overview of CL, with emphasis on modern methodological trends, evaluation inconsistencies, practical deployment constraints, and open research directions.
The main contributions of this review are as follows. First, it presents a structured organization of major CL settings, including task-incremental, domain-incremental, class-incremental, online, data-incremental, multimodal, and federated CL. Second, it summarizes major methodological categories, including regularization, replay, architecture-based, optimization-based, representation-learning, prompt-based, and parameter-efficient approaches. Third, it discusses evaluation protocols, benchmark fragmentation, memory constraints, and reproducibility issues that affect fair comparison across CL methods. Finally, it identifies emerging research directions related to foundation-model-based CL, multimodal continual adaptation, privacy-preserving CL, and real-world deployment.

2. Existing CL Surveys and Remaining Gaps

Although CL has been extensively reviewed in prior surveys, most existing works either focus on foundational concepts and traditional methodologies or specialize in narrow subdomains such as class-incremental learning, online CL, or biologically inspired approaches. Earlier surveys primarily emphasized classical paradigms including replay, regularization, and architecture-based methods, with limited discussion of recent developments driven by foundation models, vision transformers, prompt-based learning, and parameter-efficient adaptation techniques. Furthermore, many surveys provide descriptive summaries of methods without critically analyzing evaluation inconsistencies, benchmark fragmentation, memory-computation trade-offs, or the practical limitations affecting real-world deployment. Recent advances in multimodal learning, continual adaptation of large language models, diffusion-based continual generation, and privacy-aware or federated CL remain insufficiently synthesized in the literature. In addition, there is still a lack of unified discussion regarding reproducibility, task-split design, replay memory constraints, and standardized evaluation protocols, all of which significantly influence reported performance and fair comparison across methods. Motivated by these gaps, this review provides an updated and structured synthesis of modern CL research, with particular emphasis on emerging trends, evaluation challenges, scalable adaptation strategies, and open deployment issues in dynamic real-world environments.

CL

CL is a ML paradigm where models are designed to learn continuously from a stream of data, integrating new information while retaining and utilizing previously acquired knowledge [25] as shown in Figure 1A. Unlike traditional static learning methods that require retraining from scratch with all available data, CL enables systems to adapt to evolving tasks and data distributions incrementally. One of the primary challenges is catastrophic forgetting [26,27], where learning from new data typically leads to a significant decline in the model’s ability to retain previously learned information. This challenge reflects the inherent trade-off between learning plasticity and memory stability. Too much plasticity can compromise memory retention, while excessive stability may hinder the integration of new knowledge. Rather than merely adjusting the balance between these two factors, an effective CL approach should also ensure strong generalization capabilities to handle variations both within and across tasks as depicted in Figure 1B. This approach is inspired by the human ability to acquire knowledge progressively and apply it to diverse and dynamic contexts, making it essential for creating adaptive and intelligent AI systems capable of operating effectively in real-world, ever-changing environments. In recent years, numerous CL strategies have been developed to address different challenges within ML. These methods can be broadly categorized into five conceptual groups as illustrated in Figure 1C.: (1) regularization-based approaches, which introduce penalty terms based on the previous model; (2) replay-based approaches, which aim to approximate and reconstruct past data distributions; (3) optimization-based approaches, which directly adjust the optimization process; (4) representation-based approaches, which focus on learning stable and transferable feature representations; and (5) architecture-based approaches, which design adaptable model components tailored to specific tasks. This classification builds upon traditional taxonomies by incorporating recent advancements and offering more nuanced subcategories. We provide a detailed overview of how these techniques contribute to the goals of CL, analyzing both their theoretical underpinnings and practical implementations. Notably, these approaches often intersect, for instance, regularization and replay methods both influence gradient direction during optimization-and they can complement each other, such as enhancing replay effectiveness through knowledge distillation (KD) from prior models. Real-world applications introduce unique challenges to CL, which can be broadly divided into scenario complexity and task-specific demands as shown in Figure 1D. Regarding scenario complexity, issues such as the absence of task identity during training or testing, limited availability of data-sometimes arriving in small batches or even just once-pose significant hurdles. Additionally, due to the high cost and limited availability of labeled data, CL must perform well in few-shot, semi-supervised, and even unsupervised settings [28]. On the other hand, task specificity highlights that while most research has concentrated on visual classification, there is growing interest in other areas like object detection, semantic segmentation, conditional generation, reinforcement learning (RL), natural language processing (NLP), and ethical considerations, each with its own distinct challenges [13]. This review outlines these domain-specific issues and explores how CL approaches have been adapted to address them.
CL, while sharing some similarities with transfer learning, multi-task learning (MTL), and online learning, is fundamentally distinct in its objectives and methodology. Each of these paradigms addresses specific challenges in ML, but their differences lie in how they approach task sequencing, knowledge transfer, and adaptation to dynamic environments, as shown in Table 2.

CL vs. Transfer Learning

Transfer learning focuses on leveraging knowledge from a source task or domain to improve performance on a related target task or domain. Typically, this involves training a model on a large dataset (e.g., ImageNet) and fine-tuning it for a specific application [29]. The transfer process is usually one-directional and occurs only once. In contrast, CL emphasizes the ongoing acquisition of knowledge from a sequence of tasks [25], with the critical goal of retaining prior knowledge while learning new tasks. Unlike transfer learning, CL explicitly addresses the problem of catastrophic forgetting, ensuring that performance on earlier tasks is not degraded as new tasks are introduced. While both paradigms reuse knowledge, CL operates in a dynamic and evolving environment, whereas transfer learning assumes static datasets and a single knowledge transfer event.

CL vs. Multi-Task Learning

MTL focuses on learning multiple tasks simultaneously by optimizing a shared representation across all tasks [30]. This approach assumes that all tasks and their data are available during training, which allows the model to generalize effectively across tasks. In contrast, CL deals with tasks arriving sequentially, where access to prior task data may be limited or unavailable [25,31]. The primary challenge in CL is to integrate new knowledge without overwriting or forgetting previously learned tasks, while MTL does not encounter this issue since all tasks are trained together [32]. However, there is an overlap in the shared goal of improving performance across tasks, as CL can be viewed as a sequential extension of MTL in dynamic environments.

CL vs. Online Learning

Online learning involves updating a model incrementally as new data arrives, typically in a single task or stationary data distribution setting. It focuses on optimizing for real-time updates and minimizing latency, often without addressing how to handle changes in task or data distribution over time [33]. CL, on the other hand, is designed to operate in non-stationary environments where new tasks or concepts emerge sequentially. Unlike online learning, CL emphasizes the retention of knowledge across tasks and adapts to evolving distributions, ensuring that performance remains robust over time. While both paradigms involve incremental updates, CL prioritizes long-term adaptability and knowledge integration.
CL distinguishes itself by addressing the unique challenges of dynamic, real-world environments where tasks and data evolve over time. While it shares elements of knowledge transfer with transfer learning, task generalization with MTL, and incremental updates with online learning, its emphasis on lifelong learning without forgetting sets it apart. This makes CL a critical paradigm for building adaptive, resilient, and intelligent systems capable of operating effectively in ever-changing settings.
A fundamental challenge in CL is catastrophic forgetting as discussed in Section 4, a phenomenon where a model loses the ability to perform previously learned tasks when trained on new ones. This problem arises because traditional ML models are typically optimized for single-task scenarios, where the entire dataset is available at once. When these models are incrementally updated with new data, the weights and representations optimized for earlier tasks are overwritten, leading to a sharp decline in performance on prior tasks.
Addressing catastrophic forgetting is a core objective of CL. Strategies such as memory replay, parameter isolation, and regularization have been developed to mitigate this issue. These methods aim to preserve important parameters associated with past tasks, either by replaying prior data, selectively freezing certain parameters, or imposing constraints that minimize changes to previously learned representations. Despite significant progress, catastrophic forgetting remains a central obstacle in the development of robust CL systems, and overcoming it is crucial for advancing AI’s ability to function effectively in dynamic and evolving environments.

Scope and Contribution of This Review

The primary objective of this review paper is to provide a comprehensive and systematic exploration of CL, a rapidly emerging paradigm in AI [34,35,36] that enables models to learn and adapt continuously over time. As the need for adaptive systems grows in dynamic and evolving real-world environments, this paper aims to elucidate the foundational aspects of CL while offering critical insights into its current state and future prospects. One of the key goals is to categorize and analyze the different types of CL approaches, including task-based, class-based, and domain-based learning. By delineating these types, the review aims to provide clarity on the diverse ways in which CL addresses various challenges in dynamic data scenarios. Additionally, the paper seeks to delve into the theoretical underpinnings of CL, drawing connections to biological inspirations, mathematical frameworks, and its interplay with related fields such as multi-task and transfer learning.
Another important objective is to evaluate and synthesize the state-of-the-art methods in CL. This includes examining strategies such as memory-based replay, regularization techniques, parameter isolation, and hybrid approaches. Each method will be analyzed for its strengths, limitations, and suitability for specific types of tasks and data distributions. Alongside this, the review will identify and elaborate on key challenges faced by the field, particularly catastrophic forgetting, scalability, resource constraints, and the lack of standardized evaluation metrics. Furthermore, this paper will highlight the diverse applications of CL in areas such as robotics, healthcare, autonomous systems, and personalized recommendations. These examples will illustrate the transformative potential of CL in creating intelligent systems capable of adapting to changing conditions. Finally, the review will propose future directions for research, emphasizing opportunities to address existing limitations, integrate CL with complementary paradigms, and explore ethical and practical considerations in deploying such systems. By achieving these objectives, this paper aspires to serve as a comprehensive resource for researchers, practitioners, and policymakers, fostering innovation and collaboration in the advancement of CL.
As depicted Table 1, while some earlier surveys provide broad overviews of CL [37,38], they do not reflect the significant advancements made in recent years. Conversely, more recent surveys often focus on specific aspects of CL, such as its biological foundations [39,40,41], tailored approaches for visual classification [42,43,44], or particular applications in NLP [45,46] and RL [47]. This paper provides an updated synthesis focused on modern CL trends, evaluation inconsistencies, foundation models, and open deployment challenges. Building on this foundation, we offer a detailed exploration of the field, discussing emerging trends, interdisciplinary prospects, methodologies, and open challenges.

2.1. Setup

In this section, we begin by outlining the fundamental framework of CL. Following that, we discuss common scenarios and other emerging paradigms in CL.

Basic Formulation of CL

CL focuses on adapting to evolving data distributions, where training samples from different distributions are introduced sequentially. A CL model, parameterized by θ , must effectively learn the given task(s) with limited or no access to prior training samples, while maintaining strong performance on their corresponding test sets. Formally, a batch of training samples for a task t is represented as D t , b = X t , b , Y t , b , where X t , b , is the input data, Y t , b , is the data label, t T = { 1 , , k } is the task identity and b B t is the batch index ( T and B t representing their space, respectively). Here we define a task by its training samples D t following the distribution D t : = p X t , Y t ( D t denotes the entire training set by omitting the batch index, likewise for X t and Y t ) , and assume that there is no difference in distribution between training and testing. Under realistic constraints, the data label Y t and the task identity t might not be always available. In CL, the training samples of each task can arrive incrementally in batches (i.e., { { D t , b } b B t } t T ) .
If not explicitly stated, it is generally assumed that each task has enough labeled training data available, which aligns with the principles of supervised CL. Based on the given X t and Y t in each D t , CL has been expanded to various scenarios, including zero-shot learning [48,49], few-shot learning [50], semi-supervised learning [51], open-world learning (involving the identification of unknown classes followed by label incorporation) [52,53], and unsupervised or self-supervised learning [54,55].

3. Types of CL

CL encompasses various scenarios as shown in Table 3, each tailored to specific types of data streams, tasks, and requirements. These scenarios, or settings, define how models interact with new information and retain prior knowledge. The three primary types of CL are task-based learning, class-based learning, and domain-based learning. Figure 2 provides a detailed breakdown of the different categories.

Task-Incremental Learning

TIL is a paradigm within CL where models are trained sequentially on a series of distinct tasks [56,57,58]. A defining characteristic of TIL is that the identity of the task is explicitly provided during both training and inference, enabling the model to tailor its predictions based on the known task. This approach simplifies the problem of learning incrementally by focusing on task-specific outputs while preserving performance across previously encountered tasks. Table 4 presents the summary of TIL.
In TIL, each task is typically associated with a unique output head or module within the model. For example, in a classification scenario, the model might have separate output layers for each task, and the task identity directs the model to use the corresponding output layer during inference. This separation helps mitigate catastrophic forgetting, as the updates to the model parameters for one task are less likely to interfere with those of other tasks. However, this approach assumes that task boundaries are clearly defined, and the system knows which task it is addressing at any given time.
One of the primary benefits of TIL is its robustness to forgetting. Since the model is explicitly informed about the task identity, it does not need to generalize across tasks. This allows the use of task-specific components, such as output heads, which can independently optimize for the unique characteristics of each task [59]. Additionally, the modular nature of TIL enables efficient learning and evaluation of tasks without requiring retraining on the entire dataset. TIL is particularly advantageous in scenarios where task boundaries are clear and well-defined, such as robotics, where different tasks like object manipulation, navigation, and gesture recognition may need to be learned sequentially. By isolating these tasks, the model can adapt to new functionalities without compromising existing ones.

Applications of Task-Incremental Learning

TIL is commonly applied in domains where tasks are naturally discrete and identifiable. For example, in healthcare, models are trained to analyze different types of medical imaging data (e.g., X-rays, MRIs, CT scans) and can use TIL to maintain task-specific expertise [60]. Also, robots can learn individual tasks, such as grasping objects, path planning, and object categorization, without interference between tasks. The language models can incrementally learn new languages or domains while retaining the ability to process previously learned ones. By providing explicit task identities, TIL offers a structured approach to CL, making it a suitable choice for scenarios with clear task demarcation. However, as the need for more flexible and generalized models grows, hybrid approaches that incorporate elements of TIL while addressing its limitations are gaining traction.

Examples of Task Incremental Learning

1. Sequential Image Classification Across Datasets: A common example of TIL involves training a model to classify data from different image datasets sequentially [59]. For instance, a model could first learn to classify handwritten digits in the MNIST dataset. Once this task is complete, the model may be trained to classify natural images in the CIFAR-10 dataset. In this scenario, TIL ensures that the model retains its ability to classify handwritten digits while learning to classify natural images. During inference, the task identity (e.g., "MNIST" or "CIFAR-10") directs the model to use the relevant task-specific output head, ensuring accurate predictions for each dataset. This setup is particularly useful in academic and industrial settings where models need to be updated with new data or tasks without retraining from scratch.
2. Robotics: Learning Discrete Tasks: In robotics, TIL can be applied to teach a robot distinct skills sequentially [61]. For example, a robot may first learn to recognize and manipulate specific objects, followed by learning to navigate different environments. Each task is distinct, with its own set of parameters and objectives. The task identity ensures that the robot applies the appropriate skill set during operation, avoiding confusion between object manipulation and navigation.
Healthcare-Multi-Modal Diagnosis Systems: TIL is particularly beneficial in healthcare applications where systems need to process distinct datasets for different diagnostic tasks [60]. For instance, a diagnostic model could sequentially learn to analyze X-ray images for lung diseases, MRI scans for brain abnormalities, and CT scans for cardiovascular conditions. With TIL, the model can retain its diagnostic capabilities for earlier tasks while learning new ones, ensuring that expertise in analyzing X-rays is not lost when training on MRI or CT data.
In summary, each task is treated independently, reducing interference and preserving performance on previous tasks. In addition, models can optimize parameters for each task individually, enhancing performance. TIL is well-suited for domains with clearly defined and non-overlapping tasks. By applying TIL in these examples, systems can incrementally learn new tasks while retaining the performance and accuracy of previously learned ones. However, the reliance on task identity highlights one of the key limitations of TIL, it is not suited for environments where tasks are ambiguous or where the task identity is unavailable during inference. Addressing these limitations requires hybrid or alternative approaches in CL.

Challenges

Despite its advantages, TIL has certain limitations. The reliance on task identity during inference limits its applicability in scenarios where the task is unknown or ambiguous. For instance, in real-world environments where tasks may blend or overlap, the assumption of clear task boundaries may not hold. Furthermore, as the number of tasks increases, the need for task-specific components can lead to scalability issues, including memory and computational overhead. Another challenge is the lack of knowledge transfer between tasks. Since TIL separates tasks into distinct components, it often misses opportunities to share learned representations, which could enhance performance on new tasks. Research efforts are exploring ways to balance task isolation with shared representations to overcome this limitation [60].
The primary challenge in TIL is not merely to prevent catastrophic forgetting but to develop effective methods for sharing learned representations across tasks. This includes optimizing the balance between performance and computational efficiency, as well as leveraging knowledge from one task to enhance performance on others-achieving positive forward or even backward transfer between tasks [62,63]. These remain unresolved challenges. Real-world examples of TIL include learning to play various sports or musical instruments, where it is generally clear which specific sport or instrument is being practiced.

Domain-Incremental Learning

Domain-Incremental Learning (DIL) is a CL paradigm designed to handle scenarios where the model encounters new data distributions from different domains over time. Unlike TIL, the underlying task in DIL remains consistent, but the data characteristics, such as the input distribution, environment, or context, change [64,65]. This requires the model to adapt to new domains without forgetting the knowledge learned from previous domains, making it particularly suitable for real-world applications where environmental or contextual variations are frequent. In DIL, the goal is to achieve domain adaptation while maintaining performance on previously seen domains. For example, a model trained to recognize objects may initially be trained on images captured in sunny weather and later exposed to images taken in rainy or snowy conditions. The task of object recognition remains unchanged, but the input data’s domain shifts due to variations in lighting, background, or environmental factors. Table 5 presents the overview of DIL.
Unlike TIL, the task identity is typically unavailable during inference in DIL. This means that the model must generalize across domains without explicit guidance about which domain the input belongs to. Achieving this requires robust learning strategies that can minimize domain-specific biases while preserving domain-agnostic features [66,67,68]. DIL is particularly challenging due to catastrophic forgetting, as updates to accommodate new domains can overwrite knowledge about previous domains. In DIL, preventing forgetting “by design" is not feasible, making the mitigation of catastrophic forgetting a critical and unresolved challenge. Examples of this scenario include incrementally learning to recognize objects under varying lighting conditions [69] (e.g., indoors versus outdoors) or adapting to driving in different weather conditions [65]. Strategies to address this issue often involve regularization techniques, memory-based approaches, or domain-specific adaptation mechanisms.

3.0.1. Examples of Domain Incremental Learning

1. Object recognition across weather conditions: One of the most illustrative examples of DIL is in object recognition systems deployed in autonomous vehicles. A model trained to recognize traffic signs in clear weather may later need to adapt to recognizing the same signs in foggy, rainy, or snowy conditions. While the core task of traffic sign recognition remains unchanged, the model must accommodate shifts in the data distribution caused by weather changes. DIL enables the model to adapt to these conditions while retaining its ability to perform well in previously encountered weather scenarios.
2. Medical imaging: cross-domain adaptation- In healthcare, DIL can be applied to diagnostic systems that encounter data from different imaging devices or medical institutions [70]. For instance, a model trained to detect tumors in CT scans from one hospital may need to analyze scans from another hospital, where variations in imaging protocols, equipment, and patient demographics create domain shifts. Here, DIL ensures that the model adapts to the new domain while preserving its diagnostic capabilities for data from the original domain.
3. NLP: Sentiment Analysis Across Domains In NLP, DIL can address challenges such as performing sentiment analysis on text from different sources [71]. For example, a model trained on movie reviews may need to adapt to analyzing customer reviews for products or services. While the task-sentiment classification remains the same, the language style, vocabulary, and context vary across domains. DIL allows the model to generalize across these domains without sacrificing performance on the original dataset. The main advantages in these Examples are:
1. Consistent task objective: The core task remains the same, simplifying the learning objective compared to TIL.
2. Robustness to domain shifts: DIL enables models to handle variations in data distributions effectively, making them suitable for dynamic environments.
3. Scalability: By focusing on domain adaptation, DIL can be scaled to handle diverse contexts within a single task framework.

Challenges

While DIL offers a structured approach to handling domain shifts, it faces significant challenges, such as balancing domain-specific adaptation with generalization and ensuring efficiency in memory and computational resources. Future research may focus on hybrid methods that combine domain adaptation with task transfer, as well as techniques to reduce catastrophic forgetting without excessive reliance on domain-specific components. By applying DIL, systems can achieve greater flexibility and adaptability, enabling them to operate effectively in diverse and evolving environments while retaining knowledge from past domains.

Class-Incremental Learning

Class-incremental learning (CIL) is a demanding paradigm in CL where a model must sequentially learn new classes while retaining knowledge of previously learned ones [50,72]. Unlike TIL, the task identity is not provided during inference, making CIL particularly challenging as the model must classify inputs across all learned classes without prior information about the context or task [67,73,74]. This setting closely mirrors real-world scenarios, where systems are often required to expand their knowledge incrementally without forgetting prior capabilities.
Table 6 presents the overview of CIL. In CIL, the goal is to extend the model’s classification abilities incrementally by introducing new object categories or class labels over time [75]. For example, a model trained to recognize animals in a dataset containing cats and dogs may later need to classify additional animals, such as horses and birds. Unlike other CL paradigms, the model is evaluated across all classes (e.g., cats, dogs, horses, birds) simultaneously, requiring it to integrate new knowledge without degrading its performance on earlier classes. The lack of task identity during inference adds complexity, as the model must generalize across a growing set of classes without explicit guidance. This necessitates robust strategies to address catastrophic forgetting, where new learning overwrites the knowledge of earlier classes, and class imbalance, as new classes are often presented in smaller batches than the original dataset [76,77].
To mitigate these challenges, various techniques are employed, such as memory-based replay (storing examples from earlier classes)[78], KD (transferring knowledge from previous versions of the model) [79] and dynamic architectures (expanding model capacity as new classes are introduced) [59]. However, achieving high performance in CIL remains an open research problem due to the trade-offs between memory usage, computational cost, and knowledge retention.

Examples of Class Incremental Learning

1. Extending Image Classification Models: A classic application of CIL is extending image classifiers with new object categories. For instance, a model initially trained to recognize household items like chairs and tables may later need to classify electronic gadgets such as laptops and smartphones. In this scenario, the model must integrate the new categories into its knowledge base while maintaining its ability to correctly classify the earlier categories. CIL enables this incremental expansion without requiring access to the entire dataset of past classes, which may be impractical due to storage or privacy constraints. In CL, particularly CIL for visual classification tasks, state-of-the-art methods primarily concentrate on image classification, often utilizing complex and large-scale datasets like ILSVRC2012 [80] and its variations. Additionally, numerous benchmarks exist for video classification [69,81,82], differing in scale and objectives.
2. Autonomous driving: incremental object detection In autonomous vehicles, systems must continually learn to recognize new objects, such as novel road signs or vehicles, as they are introduced in different regions or regulations. For example, a car deployed in one country might later be used in another with entirely different road sign categories. CIL ensures that the system adapts to the new categories while retaining its ability to detect and classify previously learned objects.
As one of the earlier works, ILOD [83] introduced response distillation for old classes to prevent catastrophic forgetting in Fast R-CNN [84]. Building on this, RKT [85] further refined the approach by distilling co-occurrence relationships from selected proposals. KD was subsequently applied to other object detectors, including SID [86] on CenterNet [87], RILOD [88] on RetinaNet [89], ERD [90] on GFLV1 [91], CIFRCN [92], Faster ILOD [93], DMC [94], BNC [95], and IOD-ML [96] on Faster R-CNN [97], among others.
Some methods leverage unlabeled, in-the-wild data to integrate old and new models into a unified framework, addressing challenges such as non-co-occurrence (BNC [95]) and improving the stability-plasticity trade-off (DMC [94]). To minimize the adverse effects of KD on learning plasticity, IOD-ML [96] employs meta-learning to adjust parameter gradients, achieving a balance between old and new classes.
Incremental object detection (IOD) is applicable beyond 2D images, extending to 3D images [98] and videos [99]. Additional related scenarios include incremental few-shot detection [100], where a pre-trained object detector incorporates new classes with minimal annotated data, and open-world object detection [52], where the detector identifies unknown object instances and registers them upon receiving annotations.
3. Healthcare: Incremental Diagnosis Systems In medical imaging, diagnostic models often need to be updated as new diseases or imaging modalities are introduced. For instance, a model trained to detect common skin conditions might need to learn to diagnose rare disorders over time. Using CIL, the model can integrate these new categories without losing its diagnostic accuracy for previously learned conditions [101].

Challenges in Class Incremental Learning

Catastrophic Forgetting: Without access to prior task data, models are prone to forgetting earlier classes as new ones are introduced.
Scalability: As the number of classes grows, managing model capacity and computational efficiency becomes increasingly complex.
Class imbalance: New classes are often introduced with fewer examples, leading to an imbalance that can skew model performance.

3.1. Data-Incremental Learning

Data-Incremental Learning is a CL scenario where new data instances arrive over time, potentially from existing or new classes, without explicit task boundaries. Unlike CIL, where new classes are introduced sequentially and the model’s output space expands accordingly, data incremental learning focuses on the ongoing arrival of data, sometimes from previously seen classes, sometimes from novel ones, mirroring the unpredictable and nonstationary nature of real-world data streams. Table 7 depicts the overview of data-incremental learning.
In data-incremental learning, the model must adapt continuously, updating its knowledge as each new data chunk or instance is observed, while maintaining performance on previously encountered information. There are no clear demarcations between tasks or phases; instead, the learning process is fluid, with the model encountering a mix of familiar and unfamiliar data points at any time. This setting is particularly relevant for applications such as autonomous vehicles, online recommendation systems, and adaptive control systems, where data flows in constantly and the system must respond in real time to both recurring and novel patterns.
A key challenge in data-incremental learning is ensuring that the model does not forget previously learned information-catastrophic forgetting, while efficiently integrating new data, especially when the underlying data distribution changes or when new classes appear unexpectedly. The absence of explicit task boundaries and the mixture of old and new data make data incremental learning a demanding but highly practical and realistic setting for CL research and applications. The key features of data-incremental learning are:
  • Incremental data arrival: The model receives data sequentially, one instance or batch at a time, without knowledge of whether the data introduces new classes or extends existing ones.
  • No explicit task boundaries: Unlike task-based scenarios, data-incremental learning does not provide information about task transitions, requiring the model to infer patterns and adjust its learning dynamically.
  • Challenges of catastrophic forgetting: As new data arrives, the model’s parameters may be updated in ways that overwrite knowledge of previously learned classes, leading to catastrophic forgetting.
  • Adaptation and Generalization: The model must generalize well to new instances and classes while preserving accuracy on old ones, requiring a balance between plasticity (learning new data) and stability (retaining old knowledge).

Examples of Data-Incremental Learning

1. Dynamic Object Classification: In real-world object recognition systems, such as those used in smart home devices, data from cameras and sensors often arrives incrementally. For example, a system might first be trained to recognize common household objects like chairs and tables but later encounter new objects like plants or appliances. The model must adapt to recognize these new categories without losing its ability to classify previously learned objects.
Evolving customer preferences in recommendation systems: Recommendation systems often deal with continuously changing user preferences and new items being added to the catalog. For instance, a music recommendation system may encounter new genres or artists over time while users’ preferences also evolve. The system must integrate these new data points dynamically to provide accurate recommendations without retraining from scratch.
Healthcare: continuous data from medical devices: In healthcare, wearable devices and monitoring systems generate continuous streams of data. For instance, a model might initially learn to detect anomalies in heart rate data but later need to incorporate additional signals such as oxygen levels or blood pressure. Data-incremental learning enables the system to integrate this evolving information while preserving its diagnostic accuracy for previously monitored parameters.

3.1.1. Challenges in Data-Incremental Learning

  • Unstructured data streams: The lack of clear task boundaries increases the difficulty of organizing and processing data effectively.
  • Memory constraints: Retaining past data or features for replay becomes resource-intensive as the volume of data grows.
  • Class imbalance: Incrementally arriving data may introduce imbalanced class distributions, skewing the model’s performance.
To handle these challenges, several techniques are employed, including:
  • Replay mechanisms: Retaining a subset of past data or using generative models to recreate previous data for rehearsal as discussed in Section 6 in detail.
  • Dynamic networks: Expanding model capacity incrementally to accommodate new data without overwriting existing knowledge.
  • Regularization methods: Penalizing changes to parameters critical for previously learned data to mitigate forgetting as discussed in Section 6.

Other Emerging Paradigms in CL

While traditional paradigms like task-incremental, class-incremental, and DIL dominate the field, emerging paradigms are gaining traction as researchers address increasingly complex and nuanced challenges [102,103,104]. These paradigms include few-shot CL, unsupervised CL, and meta-CL, among others. Each represents an innovative approach tailored to specific needs or constraints in dynamic learning environments. Table 8 presents the overview of Emerging Paradigms in CL.

Few-Shot CL

Few-shot CL combines the principles of CL and few-shot learning [50]. It focuses on enabling models to learn new tasks or classes with very limited labeled data while retaining knowledge of previously learned tasks. This paradigm is particularly useful in scenarios where collecting extensive labeled datasets is impractical or costly, such as in medical diagnostics or rare object recognition [105,106,107]. As this technique is adapting effectively with minimal data, however, it is avoiding catastrophic forgetting. To overcome this issue, meta-learning, episodic memory, and generative replay are commonly employed to enhance the model’s ability to generalize from a few examples.

Unsupervised CL

Unsupervised CL eliminates the need for labeled data during training, focusing instead on discovering patterns and structure within data streams [54,108]. This paradigm is motivated by the vast amounts of unlabeled data generated in real-world environments, such as video surveillance, social media feeds, or IoT sensors [55,109]. However, without explicit labels, models must balance representation learning for new data while maintaining consistency with previously learned patterns. Therefore, self-supervised learning, clustering-based methods, and contrastive learning approaches are often used to extract meaningful features from unlabeled data.

Meta-CL

Meta-CL involves training models to adapt quickly to new tasks in a CL setting [110]. This paradigm leverages meta-learning principles to prepare models for future tasks by training them on a sequence of tasks, encouraging rapid adaptation and minimizing forgetting. It aims to combine the benefits of CL with the adaptability of meta-learning. The main challenge in this technique is, designing training algorithms that balance fast adaptation with stability across tasks. Gradient-based meta-learning and memory-augmented neural networks are frequently used to enhance task adaptability [111,112]..
Javed and White [113] present an online aware Meta-learning (OML) training strategy that updates sequential inputs online while minimizing interference, naturally producing sparse representations well-suited for CL. Neuromodulated Meta-Learning (ANML) [114] builds on this concept by incorporating a meta-learned, context-dependent gating function to selectively activate neurons based on incremental tasks. Attentive Independent Mechanisms (AIM) [115] further refines this approach by using a mixture of experts to make predictions with representations learned by OML [113] or ANML [114], achieving greater sparsity at the architectural level.
Meta-learning can also complement experience replay, enhancing the utilization of both old and new training samples. For instance, Meta-Experience Replay [110] aligns their gradient directions, while incremental task-agnostic meta-learning [116] uses a meta-updating rule to balance these gradients. Look-Ahead MAML [117] combines experience replay with an online optimization of OML [113] objectives, incorporating an adaptively modulated learning rate. OSAKA [118] introduces a hybrid objective focused on knowledge accumulation and rapid adaptation, achieved by meta-training for a robust initialization and integrating incremental task knowledge into it.
Meta-learning also facilitates the optimization of specialized architectures. MERLIN [119] learns a meta-distribution of model parameters for each task’s representations, enabling task-specific model sampling and ensemble inference. Similarly, Henning and Cervera [120] employs a Bayesian approach to learn task-specific posteriors from a shared meta-model. MetA Reusable Knowledge or MARK [121] maintains shared weights incrementally updated through meta-learning and selectively masked for task-specific applications. Anti-Retroactive Interference for lifelong learning or ARI [122] integrates adversarial attacks with experience replay to generate task-specific models, which are then merged using meta-training.

Federated CL

In federated CL, models are trained across distributed devices or nodes, with each node continually receiving new data [123]. This paradigm is particularly relevant for privacy-preserving applications, such as personalized healthcare or mobile device personalization. However, the challenges in this technique are balancing knowledge sharing across nodes without violating privacy, handling heterogeneous data distributions, and avoiding forgetting across nodes. The key techniques are decentralized learning algorithms, secure aggregation, and adaptive synchronization protocols.

Multi-Agent CL

Multi-agent CL explores scenarios where multiple models or agents learn and interact in a shared environment, continually adapting to new tasks or domains. This paradigm is particularly useful for collaborative robotics, multi-player gaming AI, and distributed sensor networks.
Challenges: Coordinating knowledge transfer between agents and managing inter-agent dependencies [124].
Key techniques: Communication protocols, shared memory systems, and ensemble learning approaches.
While addressing catastrophic forgetting in agents, it is also essential to reduce interference from other agents, leverage their knowledge effectively, preserve client privacy, and limit the accumulation and spread of errors. Additionally, CL algorithms should be readily adaptable to multi-agent systems. Given that multi-agent learning introduces additional computational overhead, particularly in terms of communication, and is often deployed on edge devices, CL methods should aim to be as computationally efficient as possible. For instance, Yoon et al. [124] propose a federated CL framework aimed at reducing both inter-client interference and communication overhead. In typical federated learning setups, a central server aggregates updates from multiple clients and distributes a global model. However, merging knowledge from clients trained on different data distributions can lead to catastrophic forgetting. To address this, the authors divide the model parameters into three categories: (1) a global dense base parameter set that captures shared knowledge across all clients, (2) a local base parameter set for client-specific general knowledge, and (3) a sparse task-adaptive parameter set tailored to each client’s current task. Clients selectively activate the task-adaptive parameters using attention masks.
In a related approach, an enhanced version of Learning without Forgetting is employed to retain learned information locally while enabling KD from the central server [125]. Park et al. [126] further investigate the integration of rehearsal strategies into federated CL. Due to the critical importance of privacy in federated settings, they introduce variational embeddings to encode and transmit task-relevant data to the server. These embeddings are then used for server-side training, allowing the system to replay past knowledge and mitigate forgetting.

4. Theoretical Foundations of CL

CL aims to develop models that can acquire new knowledge from sequential data while preserving previously learned information. Its theoretical foundations are centered on four closely related issues: the stability-plasticity dilemma, catastrophic forgetting, forward and backward transfer, and representation learning. These concepts explain why learning continuously is difficult and why different CL methods attempt to regulate parameter updates, preserve useful representations, or reuse prior knowledge. Table 9, shows the major theoretical foundation in CL.

Stability-Plasticity Dilemma

Building upon the formulation introduced in Section 2, consider a model with parameters θ R | θ | trained on a sequence of k tasks. For each task t = 1 , , k , the training set is defined as
D t = X t , Y t = x t , n , y t , n n = 1 N t ,
where N t denotes the number of samples in task t. The objective is to learn from the sequence
D 1 : k : = { D 1 , , D k }
while maintaining strong performance across both previous and current tasks. Assuming conditional independence across tasks, the joint likelihood can be written as
p ( D 1 : k θ ) = t = 1 k p ( D t θ ) .
For discriminative models, the log-likelihood of task t is commonly expressed as
log p ( D t θ ) = n = 1 N t log p θ ( y t , n x t , n ) .
The main difficulty arises because, when learning a new task D k , data from previous tasks { D 1 , , D k 1 } may be unavailable or only partially accessible. The model must therefore adapt to new information while preserving knowledge learned from earlier tasks. This tension is known as the stability-plasticity dilemma. Stability refers to the ability to retain previous knowledge, whereas plasticity refers to the ability to learn new information [127,128,129,130,131]. Excessive plasticity may cause catastrophic forgetting, while excessive stability may prevent adaptation to new tasks.
Figure 3. The stability-plasticity dilemma in CL. Excessive plasticity may overwrite prior knowledge, whereas excessive stability may restrict adaptation to new tasks.
Figure 3. The stability-plasticity dilemma in CL. Excessive plasticity may overwrite prior knowledge, whereas excessive stability may restrict adaptation to new tasks.
Preprints 214096 g003

Catastrophic Forgetting

Catastrophic forgetting is the degradation of performance on previously learned tasks after training on new data [132,133]. In neural networks, this problem occurs because gradient-based optimization updates shared parameters, which may overwrite representations that were important for earlier tasks. Forgetting is especially severe when tasks share overlapping parameters but require different decision boundaries or feature representations.
Most CL methods can be interpreted as attempts to reduce destructive interference. Regularization-based methods penalize changes to parameters that are important for previous tasks. Replay-based methods preserve earlier knowledge by revisiting stored or generated samples from previous tasks. Architecture-based methods reduce interference by allocating task-specific parameters or expandable modules. Although these strategies differ technically, they share the same goal: preserving useful prior knowledge while enabling adaptation to new data.

Forward and Backward Transfer

CL is not only concerned with avoiding forgetting. A strong CL system should also support transfer across tasks. Forward transfer occurs when knowledge learned from previous tasks improves learning on future tasks. For example, representations learned for object recognition may accelerate learning of related categories. Backward transfer occurs when learning a new task improves performance on earlier tasks, indicating that the model has refined or reorganized previous knowledge in a useful way.
Positive transfer is desirable because it allows CL systems to reuse knowledge rather than treating each task independently. However, transfer can also be negative when learning one task interferes with another. Therefore, an effective CL method should maximize beneficial transfer while minimizing harmful interference.

Representation Learning

Representation learning plays a central role in CL because the quality of learned features strongly affects both forgetting and transfer. If a model learns task-specific representations that are highly dependent on one dataset or task distribution, it may struggle to generalize to future tasks. In contrast, stable and reusable representations can reduce interference and improve adaptation.
Recent CL research increasingly emphasizes task-agnostic, domain-invariant, and transferable representations. Self-supervised learning and contrastive learning are often used to learn features that remain useful across different tasks and domains. Such representations can improve forward transfer and reduce the need for large replay buffers, although they may still require additional mechanisms when task distributions are highly heterogeneous.

Neuroscientific Motivation

CL is partly inspired by human and biological learning systems, which can acquire new knowledge without rapidly erasing prior memories [40,41,134]. Several biological mechanisms have influenced CL research. Synaptic consolidation preserves important memories by stabilizing relevant neural connections, inspiring methods such as Elastic Weight Consolidation (EWC), which penalizes changes to parameters important for earlier tasks [135,136,137,138]. Neurogenesis motivates expandable architectures that allocate new capacity for new tasks. Rehearsal-based memory mechanisms inspire replay methods, where previous samples or generated approximations are revisited during training.
These biological analogies should not be treated as direct implementations of human learning. Rather, they provide useful conceptual guidance for designing artificial systems that balance retention, adaptation, and efficient memory use.
In summary, the theoretical foundations of CL revolve around the need to balance stability and plasticity, reduce catastrophic forgetting, promote useful knowledge transfer, and learn reusable representations. These principles provide the basis for the major methodological families discussed in later sections, including regularization, replay, architecture-based adaptation, representation learning, and parameter-efficient continual adaptation.

Mathematical Frameworks

The theoretical understanding of CL is further supported by formal mathematical models that offer precise descriptions of its mechanisms and challenges.

Regularization-Based Models

These models introduce constraints during the optimization process to prevent significant changes to parameters critical for previous tasks. For instance, EWC adds a quadratic penalty to the loss function for parameters identified as important for earlier tasks, preserving stability without hindering plasticity.

Replay and Memory Models

Replay-based methods integrate past and current data to retain earlier knowledge while learning new tasks. For example, Experience Replay minimizes the combined loss from new task data and stored samples from previous tasks, balancing plasticity and stability. Synthetic replay techniques recreate data distributions from previous tasks using generative models.

Dynamic Architectures

Dynamic approaches adapt the model’s structure to accommodate new tasks without overwriting prior knowledge. For example, Progressive Neural add new parameters for each task while maintaining frozen connections to prior tasks, ensuring stability.

Bayesian Models

Bayesian frameworks incorporate uncertainty into parameter updates, balancing the importance of old and new knowledge. Such as, Variational CL uses Bayesian inference to estimate parameter importance and preserve previously learned distributions.

Information-Theoretic Models

These models quantify the trade-off between stability and plasticity using information theory: for instance, methods based on mutual information optimize how much knowledge from prior tasks is retained while maximizing adaptability to new tasks.
Overall, the theoretical foundations of CL integrate key concepts, biological inspirations, and mathematical models to address the dual challenge of retaining past knowledge and acquiring new knowledge. By leveraging insights from neuroscience, cognitive science, and formal mathematical frameworks, researchers continue to develop algorithms that strike an optimal balance between stability and plasticity, paving the way for robust and adaptable AI systems. These principles form the cornerstone of ongoing advancements in the field of CL.

5. The Catastrophic Forgetting Problem

Catastrophic forgetting, also known as catastrophic interference, is a significant challenge in neural networks, particularly in CL scenarios [132,139]. It refers to the dramatic loss of performance on previously learned tasks when the model is trained on new tasks. This problem arises because of the way neural networks learn and update their parameters, creating conflicts between the objectives of retaining past knowledge and learning new information. Table 10 presents the overview of catastrophic forgetting.

Why Neural Networks Forget

Neural networks are typically trained using gradient-based optimization techniques, such as stochastic gradient descent. During training, the network adjusts its weights and biases to minimize a loss function that quantifies the model’s error on the current task. These updates are global, meaning they affect all parameters of the network, regardless of their relevance to previously learned tasks. When a new task is introduced, the model updates its weights to optimize performance on the new task. However, this optimization does not explicitly account for the importance of certain weights to earlier tasks [140]. Consequently, the new task’s learning process can overwrite or “interfere" with the representations encoded in the network for prior tasks. This phenomenon is often referred to as parameter drift, where the values of critical parameters for earlier tasks shift during the learning of new tasks.

Weight Updates and Parameter Drift

The neural network’s capacity to learn and retain knowledge is directly tied to its parameters (weights and biases). For each task, certain parameters become more significant in capturing its patterns and features. For instance: Task A might heavily rely on a subset of parameters to classify images of animals. Task B, introduced later, might require modifying the same subset of parameters to classify images of vehicles. In the absence of mechanisms to preserve the importance of parameters for Task A, updates during Task B’s training overwrite the information stored in those parameters. This leads to a degradation in the network’s ability to perform Task A, which manifests as catastrophic forgetting. Parameter drift occurs because standard gradient-based optimization lacks constraints to differentiate between parameters that are crucial for previously learned tasks and those that can be safely modified for new learning. Without explicit mechanisms to mitigate this drift, the network effectively “forgets" earlier tasks as it learns new ones.

5.1. Factors Exacerbating Catastrophic Forgetting

1. Overlapping Representations: Neural networks often use overlapping representations for different tasks, meaning that the same subset of neurons and parameters is reused across tasks. While this enables compact and efficient learning, it also increases the likelihood of interference, as updates for one task can disrupt representations needed for another [141].
2. Sequential data access: In CL, data from previous tasks is typically unavailable during the training of new tasks due to memory constraints or privacy concerns. This sequential access to data makes it difficult for the model to revisit and reinforce earlier learning, exacerbating forgetting [141].
3. Lack of task awareness: In class-incremental and DIL, the model is not explicitly informed about the task identity during inference. This forces the network to integrate knowledge across tasks, increasing the risk of overwriting previously learned information [11].

5.2. Mitigation Strategies

Researchers have developed various strategies to address catastrophic forgetting. Table 11 presents the summary of mitigating strategies.

6. Method Taxonomy in CL

CL methods are designed to mitigate catastrophic forgetting while enabling models to acquire new knowledge from sequential tasks. Existing approaches address the stability-plasticity trade-off through different mechanisms, including constraining parameter updates, replaying past information, expanding or isolating model components, modifying optimization dynamics, learning transferable representations, and adapting large pretrained models through parameter-efficient mechanisms. These methods are not mutually exclusive; many recent approaches combine replay, regularization, distillation, representation learning, and architectural adaptation to improve performance across different CL settings.
Figure 4. Overview of representative CL strategies. Replay-based methods revisit stored or generated samples from previous tasks. Regularization-based methods constrain changes to parameters or functions that are important for prior knowledge. Architecture-based methods allocate task-specific components or expand model capacity to reduce interference across tasks.
Figure 4. Overview of representative CL strategies. Replay-based methods revisit stored or generated samples from previous tasks. Regularization-based methods constrain changes to parameters or functions that are important for prior knowledge. Architecture-based methods allocate task-specific components or expand model capacity to reduce interference across tasks.
Preprints 214096 g004

Regularization-Based Methods

Regularization-based methods mitigate catastrophic forgetting by constraining changes to parameters or model outputs that are important for previously learned tasks [20,118,142]. These methods are attractive because they usually do not require storing large replay buffers, making them suitable for memory-limited or privacy-sensitive settings. They can be broadly divided into weight regularization and function regularization.
Weight regularization constrains changes in network parameters. A common strategy is to add a penalty term to the loss function that discourages updates to parameters estimated to be important for previous tasks. Elastic Weight Consolidation (EWC) uses the Fisher Information Matrix (FIM) to estimate parameter importance [14], while later variants improve scalability or approximation quality [143,144]. Function regularization, in contrast, preserves the behavior of the previous model by aligning intermediate features or output distributions. This is commonly implemented through knowledge distillation (KD), where the previous model acts as a teacher and the current model is trained to retain its functional behavior [145]. Because previous data are often unavailable in CL, distillation may rely on current samples [146,147,148], limited old samples [149,150,151], external unlabeled data, or synthetic samples [152,153].
Figure 5. Regularization-based methods preserve previous knowledge by constraining either important parameters through weight regularization or model behavior through function regularization.
Figure 5. Regularization-based methods preserve previous knowledge by constraining either important parameters through weight regularization or model behavior through function regularization.
Preprints 214096 g005

6.0.1. Elastic Weight Consolidation

EWC is inspired by synaptic consolidation and penalizes changes to parameters that are important for previous tasks [143,144,154]. It estimates parameter importance using the FIM [155]. The EWC objective is expressed as:
L EWC = L current + λ i F i θ i θ i * 2 ,
where L current is the loss on the current task, θ i is the current parameter value, θ i * is the parameter value learned from previous tasks, F i is the Fisher importance estimate, and λ controls the strength of regularization.
EWC is simple and effective when task boundaries are known, but its performance can degrade as the number of tasks increases. Storing task-specific Fisher estimates may also become costly. Several extensions address these limitations. Liu et al. [154] reduce errors from diagonal FIM approximations through parameter-space rotation. Ritter et al. [143] use a Kronecker-factored block diagonal Hessian approximation, while online EWC maintains a single running penalty instead of storing separate Fisher matrices for each task [144]. Incremental Moment Matching follows a related Bayesian view by using the posterior of previous tasks as the prior for new tasks [156].

6.0.2. Synaptic Intelligence and Related Methods

Synaptic Intelligence (SI) estimates parameter importance online by tracking each parameter’s contribution to loss reduction during training. Unlike EWC, SI does not rely directly on the Fisher matrix and can be applied in more flexible settings. The importance score is computed as:
ω i = t Δ θ i t · Δ L t Δ θ i t 2 + ϵ ,
where Δ θ i t denotes the change in parameter θ i during task t, Δ L t is the corresponding change in loss, and ϵ avoids division by zero. The SI loss is:
L SI = L current + λ i ω i θ i θ i * 2 .
SI dynamically estimates importance during learning, but it requires tracking parameter contributions and may be less reliable when parameter importance is not strongly aligned with loss reduction. Riemannian Walk combines the FIM with the online path integral from SI to improve importance estimation [157]. Memory Aware Synapses (MAS) estimates parameter importance by measuring the sensitivity of the output function to parameter changes, making it applicable even with unlabeled data [158]. Benzing and Frederik [155] show that EWC, SI, and MAS are closely related through their use of Fisher-based importance estimation. Variational CL further frames continual adaptation as recursive Bayesian inference, where the posterior from previous tasks regularizes learning on new tasks [159].

Replay-Based Methods

Replay-based methods mitigate forgetting by revisiting information from previous tasks during training. They are among the most effective CL approaches, particularly in class-incremental learning. Replay can be implemented by storing raw samples, storing compressed representations, replaying features, or generating synthetic data that approximates previous task distributions.
Figure 6. Replay-based methods approximate previous task distributions through stored samples, generated samples, or feature-level replay. Experience replay retains representative examples, generative replay synthesizes old samples, and feature replay stores or reconstructs previous representations.
Figure 6. Replay-based methods approximate previous task distributions through stored samples, generated samples, or feature-level replay. Experience replay retains representative examples, generative replay synthesizes old samples, and feature replay stores or reconstructs previous representations.
Preprints 214096 g006

Experience Replay

Experience replay stores a subset of previous samples in a memory buffer and mixes them with current task data during training. The replay objective can be written as:
L Replay = α L current + ( 1 α ) L replay ,
where L current is the loss on new task data, L replay is the loss on replayed data, and α balances current and past information.
The main advantage of experience replay is that it directly preserves representative samples from previous tasks. However, its effectiveness depends strongly on memory size, sample selection, and replay scheduling. Storing raw data may also be infeasible in privacy-sensitive applications.
Early memory selection strategies include Reservoir Sampling [110,160,161], Ring Buffer [62], and mean-of-feature selection as used in iCaRL [72]. Other strategies use clustering, plane distance, or entropy-based selection [110,160]. More advanced methods select samples based on gradient diversity or optimization objectives, including GSS [142], CCBO [162], OCS [163], ASER [164], Rainbow Memory [165], and GCR [166].
Several works improve replay efficiency by compressing, augmenting, or editing memory samples. Adaptive Quantization Modules use vector-quantized compression for memory-efficient replay [118,167]. Memory replay with data compression models storage allocation using determinantal point processes [168,169]. Rainbow Memory increases diversity through augmentation [165], while Retrospective Adversarial Replay generates challenging samples near forgetting boundaries and applies MixUp [170,171]. Other approaches store auxiliary information such as dual memory statistics [172] or attention maps [173]. Memory samples can also be updated to become more representative or more challenging, as in Mnemonics [174] and Gradient-based Memory Editing [175].
Replay is also frequently combined with constrained optimization and knowledge distillation. GEM constrains updates so that losses on stored samples do not increase [62], while A-GEM improves efficiency by replacing task-specific constraints with a global replay constraint [176]. Meta-Experience Replay encourages gradient alignment between old and new samples [110], and later works explore task-gradient decomposition, saddle-point optimization, Pareto balancing, and selective replay [85,177,178,179,180].
In class-incremental learning, replay is often paired with distillation. iCaRL [72] and EEIL [149] combine exemplars with KD. LUCIR improves feature consistency and reduces classifier bias [151], while BiC [181], WA [182], and SS-IL [183] address class imbalance and bias. PODNet preserves spatial representations through distillation [150], Co2L uses self-supervised distillation [184], GeoDL aligns old and new feature spaces [185], and ELI uses energy-based alignment [186]. Other methods enhance distillation through uncertainty, feature-space structure, task attention, or dynamic expansion [187,188,189,190,191,192]. Weight regularization can also be combined with replay to improve stability [157,193].
Despite its strength, experience replay may overfit to the limited stored samples [194]. LiDER addresses this by enforcing Lipschitz continuity [195], while MOCA increases representation variability to prevent feature contraction [196]. Strong simple baselines such as DER/DER++ [197], X-DER [198], and GDumb [199] show that replay design remains a critical factor in fair CL evaluation.

Generative Replay

Generative replay reduces the need to store raw samples by training a generative model to synthesize data from previous tasks [73,74]. Synthetic samples are combined with current task data during training, allowing the model to rehearse earlier distributions without explicit access to old data.
Generative replay is appealing for privacy-sensitive and memory-constrained settings, but it introduces additional computational cost and depends heavily on the quality of generated samples. Poor generative models may produce biased or low-diversity samples, which can weaken retention. GAN-based methods often generate high-quality samples but may suffer from label inconsistency or mode collapse [200,201]. Autoencoder-based methods offer more explicit label control but may produce less detailed samples, as seen in FearNet [202], SRM [203], CLEER [204], EEC [200], GMR [205], and Flashcards [206]. Hybrid approaches such as L-VAEGAN combine generative quality with more precise inference [207].
DGR provides a foundational generative replay framework by replaying samples from a previous generator while learning new tasks [73]. MeRGAN improves consistency through replay alignment [152]. Generative replay can also be combined with weight regularization [51,159,208], experience replay [51,209], masking and expandable architectures [201], and pretrained feature statistics [202,210,211]. Because full data generation remains expensive, feature replay has emerged as a lighter alternative. GFR replays generated features after the feature extractor [212], and BI-R replays internal representations using context-modulated feedback connections [74]. Large-scale pretraining can further stabilize feature representations for downstream CL [213].
Table 12. Comparison between experience replay and generative replay.
Table 12. Comparison between experience replay and generative replay.
Aspect Experience Replay Generative Replay
Memory usage Stores selected raw samples or compressed examples. Stores a generative model that synthesizes previous data.
Privacy May be problematic when previous data are sensitive. Avoids direct storage of raw samples but may still leak information if not properly controlled.
Replay quality High fidelity because original samples are replayed. Depends on generator quality, diversity, and label consistency.
Computational cost Relatively low compared with training a generator. Higher due to training and maintaining a generative model.
Best suited for Class-incremental learning and reinforcement learning when memory is available. Privacy-sensitive or memory-constrained settings where raw data cannot be stored.

Architecture-Based Methods

Architecture-based methods reduce forgetting by modifying model structure. They allocate task-specific parameters, subnetworks, masks, or expandable modules to reduce interference between tasks. These approaches are effective when task boundaries are known because the model can activate task-specific components during training and inference.
Progressive Neural Networks, dynamically expandable networks, PackNet-like pruning strategies, and modular expert-based models are representative examples. Their major strength is strong knowledge preservation through parameter isolation. However, their main weakness is scalability: as the number of tasks increases, model size and computational cost may grow substantially. Therefore, architecture-based methods are best suited for task-incremental settings or applications where a moderate number of clearly defined tasks is expected.

Optimization-Based Methods

Optimization-based methods directly control gradient updates to reduce interference between old and new tasks. Rather than storing knowledge only through parameters or samples, these methods modify the optimization trajectory so that learning new tasks does not substantially harm previous tasks. GEM and A-GEM are common examples because they project or constrain gradients using replay memory [62,176]. Other methods encourage gradient alignment, reduce conflicting gradients, or balance stability and plasticity through multi-objective optimization [85,110,178].
These methods provide principled mechanisms for reducing destructive interference, but they may introduce computational overhead due to gradient storage, projection, or constraint solving. Their effectiveness also depends on the quality and representativeness of replay samples.

Representation-Learning Methods

Representation-learning methods aim to learn features that remain stable and transferable across tasks. Instead of only protecting parameters or replaying data, these approaches improve the quality of the feature space so that future tasks can be learned with less interference. Self-supervised learning, contrastive learning, feature disentanglement, and pretrained representations are increasingly used for this purpose [29,54,109].
Strong representations can improve forward transfer and reduce the need for large replay buffers. However, they are not sufficient by themselves when task distributions are highly heterogeneous or when new tasks require substantially different decision boundaries. Therefore, representation learning is often combined with replay, distillation, or regularization.

Parameter-Efficient and Prompt-Based Methods

With the increasing use of large pretrained models, parameter-efficient and prompt-based CL methods have become important. Instead of updating the entire model, these methods adapt only a small number of parameters, such as prompts, adapters, prefixes, or low-rank modules. This reduces computational cost and limits interference with pretrained representations.
Prompt-based methods such as L2P, DualPrompt, and related approaches use learnable prompts to guide task adaptation while keeping most backbone parameters fixed [214,215,216,217]. Parameter-efficient transfer learning methods from NLP also provide useful tools for CL because they allow large models to incorporate new knowledge without full fine-tuning. These approaches are especially promising for foundation models, multimodal systems, and large-scale deployment scenarios. However, prompt selection, prompt interference, adapter growth, and task identity uncertainty remain open problems.
In summary, CL methods differ in how they preserve previous knowledge and support adaptation. Regularization-based methods constrain parameter or function changes, replay-based methods revisit previous information, architecture-based methods isolate or expand capacity, optimization-based methods control gradient interference, representation-learning methods improve feature transferability, and parameter-efficient methods adapt large pretrained models with limited updates. In practice, the strongest CL systems often combine several of these strategies to balance accuracy, memory efficiency, computational cost, scalability, and robustness.

7. Evaluation Protocols, Benchmarks, and Metrics in CL

A major challenge in CL research is the lack of standardized evaluation protocols and benchmark settings. Although numerous methods have been proposed to mitigate catastrophic forgetting, fair comparison across studies remains difficult due to variations in task construction, dataset partitioning, memory budgets, evaluation metrics, and training protocols. As a result, reported performance often depends not only on the effectiveness of the proposed method but also on the experimental setup itself. Establishing consistent and reproducible evaluation practices is therefore essential for accurately assessing CL systems.

Benchmark Datasets

Benchmark datasets play a central role in evaluating CL methods. Existing benchmarks can generally be categorized into image classification, object detection, video understanding, reinforcement learning, natural language processing, and medical imaging benchmarks. In image classification, commonly used benchmarks include Split-MNIST, Permuted-MNIST, Split CIFAR-10, Split CIFAR-100, TinyImageNet, and ImageNet-based incremental benchmarks. These datasets are frequently used in task-incremental, domain-incremental, and class-incremental learning scenarios due to their controllable task construction and broad adoption in prior studies.
More challenging and realistic benchmarks have also emerged in recent years. CORe50 introduces continuous object recognition under varying environmental conditions, while Stream-51 focuses on streaming video-based CL. CLEAR (CL on Real-world imagery) was proposed to evaluate CL under more realistic temporal and environmental variations. Such benchmarks aim to move beyond artificially segmented tasks toward real-world continual adaptation scenarios.
In medical imaging, CL benchmarks remain relatively limited despite increasing interest in lifelong medical AI systems. Existing studies commonly utilize datasets from segmentation and classification tasks, including retinal imaging, brain tumor segmentation, cardiac ultrasound, and chest X-ray analysis. However, evaluation protocols in medical CL are often inconsistent due to domain shifts across hospitals, imaging devices, and annotation standards.

CL Evaluation Settings

CL evaluation protocols typically differ according to the underlying learning scenario. The three most widely studied settings are task-incremental learning (TIL), domain-incremental learning (DIL), and class-incremental learning (CIL). In TIL, task identity is available during inference, allowing the model to utilize task-specific components or output heads. This setting is generally considered less challenging because the model is informed about the current task context.
In DIL, the task objective remains unchanged while the input distribution changes over time. The model must adapt to domain shifts without explicit task identity information during inference. CIL represents the most challenging and practically relevant setting. In this scenario, new classes are introduced sequentially, and the model must classify samples across all previously encountered classes without access to task identity during inference. This setting closely resembles real-world deployment conditions where task boundaries are often unavailable.
Recent research has also explored online CL, streaming CL, multimodal CL, and federated CL settings. These protocols introduce additional constraints such as single-pass training, privacy preservation, communication efficiency, and cross-modal adaptation.

Task Construction and Data Splits

Task construction significantly influences CL performance. Different studies often use distinct task partitioning strategies even when evaluating on the same dataset, making direct comparison difficult. For example, CIFAR-100 may be divided into 10 tasks with 10 classes each, 20 tasks with 5 classes each, or arbitrary task groupings depending on the experimental design. Similarly, the order of tasks can substantially affect forgetting behavior and knowledge transfer. Some studies use fixed task orders, whereas others employ random task permutations across multiple runs. Unfortunately, many works report results using only a single task ordering, which may introduce evaluation bias. Another important issue involves base initialization protocols. Some methods begin with a large base task followed by incremental updates, whereas others use fully balanced sequential tasks. These choices directly influence feature stability, representation quality, and replay effectiveness. To improve reproducibility, recent works increasingly recommend reporting task construction details explicitly, evaluating across multiple random seeds, testing multiple task orders, and using consistent train-validation-test splits.

Evaluation Metrics

Several evaluation metrics have been proposed to measure CL performance. However, no single metric fully captures all aspects of continual adaptation.

7.0.1. Average Accuracy

Average accuracy is one of the most widely used metrics in CL. After learning task T, the average accuracy is computed as:
A T = 1 T i = 1 T a T , i
where a T , i denotes the test accuracy on task i after learning task T.
Although average accuracy provides a general measure of overall performance, it does not explicitly quantify forgetting or transfer behavior.

Forgetting Measure

Forgetting evaluates how much performance degrades on previous tasks after learning new tasks. A commonly used formulation is:
F T = 1 T 1 i = 1 T 1 max l { 1 , , T 1 } a l , i a T , i
Lower forgetting values indicate better knowledge retention.

Forward Transfer

Forward transfer measures whether previously acquired knowledge improves learning efficiency on future tasks. Positive forward transfer indicates that earlier representations facilitate adaptation to new tasks.

Backward Transfer

Backward transfer evaluates whether learning new tasks improves performance on earlier tasks. Positive backward transfer reflects beneficial knowledge integration across tasks, whereas negative backward transfer indicates forgetting.

Memory and Computational Efficiency

In replay-based methods, memory budget is a critical evaluation factor. Some methods achieve high performance by storing large replay buffers, making comparisons unfair when memory constraints differ substantially across studies. Computational efficiency is also increasingly important, particularly for large-scale transformer-based and foundation-model CL systems. Metrics such as training time, parameter growth, inference latency, and energy consumption are becoming relevant for real-world deployment.

Challenges in CL Evaluation

Despite substantial progress, CL evaluation remains fragmented and inconsistent. Several key issues continue to hinder fair comparison across methods. These include different task splits and benchmark configurations, inconsistent replay memory budgets, variation in model initialization strategies, different validation protocols and hyperparameter tuning approaches, limited reporting of statistical variance across multiple runs, and heavy reliance on simplified academic benchmarks. Furthermore, many studies evaluate methods primarily on image classification datasets while neglecting more realistic settings such as multimodal learning, streaming environments, dense prediction tasks, and large-scale foundation-model adaptation. Another emerging challenge involves evaluating CL under realistic deployment constraints, including privacy preservation, domain shift, continual annotation updates, and resource-limited edge devices.

Toward Standardized Evaluation Protocols

To improve reproducibility and practical relevance, future CL research should prioritize standardized evaluation protocols and transparent reporting practices. This includes consistent benchmark construction, clearly defined task splits, fixed memory constraints, multi-seed evaluation, reporting of computational complexity, and broader adoption of realistic streaming benchmarks. In addition, future benchmarks should increasingly incorporate foundation models, multimodal data streams, domain shifts, and real-world deployment constraints. Such efforts are essential for transitioning CL from controlled academic experiments toward robust lifelong learning systems capable of operating effectively in dynamic environments.

8. Comparative Analysis of CL Methods

CL (CL) methods have evolved substantially over the past decade, leading to a diverse set of strategies designed to mitigate catastrophic forgetting while enabling adaptation to new tasks and domains. Despite the rapid development of the field, no single method consistently outperforms others across all CL scenarios. Different approaches exhibit distinct strengths and limitations depending on the task setting, memory constraints, computational budget, and availability of task identity information. This section provides a comparative analysis of major CL method categories, focusing on their underlying principles, advantages, limitations, scalability, and suitability for practical deployment. Table 13 analyses the major CL method categories.

Overview of CL Method Categories

Existing CL methods can generally be categorized into regularization-based methods, replay-based methods, architecture-based methods, optimization-based methods, representation-learning methods, and more recent parameter-efficient and prompt-based adaptation approaches. Although these categories are conceptually distinct, many modern approaches combine multiple strategies to improve stability and adaptability.
Regularization-based methods aim to preserve previously learned knowledge by constraining parameter updates during new task learning. Replay-based methods retain or reconstruct previous data distributions to reinforce earlier knowledge. Architecture-based methods dynamically expand or isolate network components to reduce interference between tasks. Optimization-based methods directly manipulate gradient updates to minimize forgetting. Representation-learning approaches focus on learning transferable and stable features across tasks. More recently, parameter-efficient and prompt-based methods have emerged as promising solutions for adapting large-scale foundation models under CL settings.

Regularization-Based Methods

Regularization-based methods are among the earliest and most widely studied CL approaches. These methods attempt to preserve previously acquired knowledge by penalizing changes to parameters that are considered important for earlier tasks. Representative approaches include Elastic Weight Consolidation (EWC), Synaptic Intelligence (SI), and Memory Aware Synapses (MAS). A major advantage of regularization-based methods is their relatively low memory overhead since they do not require storing large replay buffers. These methods are computationally efficient and relatively easy to integrate into existing training pipelines. Furthermore, they are well suited for privacy-sensitive applications where retaining previous training data may be infeasible.
However, regularization-based methods often struggle in highly non-stationary environments or long task sequences where task distributions differ substantially. Since they rely primarily on parameter constraints, their ability to preserve fine-grained task representations decreases as the number of tasks grows. Consequently, forgetting may accumulate gradually over time. Regularization approaches are generally more effective in task-incremental settings where task boundaries are relatively distinct and interference between tasks is moderate.

Replay-Based Methods

Replay-based methods mitigate forgetting by revisiting previously learned data during training. These methods either store a subset of past samples in memory buffers or generate synthetic replay samples using generative models. Representative approaches include Experience Replay (ER), Gradient Episodic Memory (GEM), iCaRL, Dark Experience Replay (DER), and generative replay methods. Replay-based strategies have demonstrated strong empirical performance across various CL benchmarks, particularly in class-incremental learning scenarios. By repeatedly exposing the model to previous data distributions, replay methods effectively stabilize feature representations and reduce catastrophic forgetting.
Despite their effectiveness, replay-based methods introduce several challenges. Storing previous samples increases memory requirements, which can become problematic in large-scale or privacy-sensitive applications. Moreover, replay performance depends heavily on memory selection strategies, replay buffer size, and sample diversity. Generative replay methods alleviate explicit storage requirements but often suffer from imperfect sample generation and increased computational complexity. Replay-based methods remain among the most effective approaches for class-incremental learning, especially when moderate memory budgets are available.

Architecture-Based Methods

Architecture-based methods address catastrophic forgetting by modifying the network structure itself. These approaches allocate dedicated parameters, subnetworks, or modules for different tasks, thereby reducing interference between old and new knowledge. Representative methods include Progressive Neural Networks, Dynamically Expandable Networks (DEN), PathNet, and expert-based modular architectures. One of the primary strengths of architecture-based methods is their ability to preserve previously learned representations with minimal forgetting. Since task-specific parameters are isolated, interference between tasks is significantly reduced. However, these methods often suffer from scalability limitations. As the number of tasks increases, network size and computational complexity may grow substantially. Furthermore, architecture expansion can become inefficient for long CL sequences or resource-constrained deployment environments. Architecture-based approaches are particularly effective in task-incremental settings where task identity is available during inference.

Optimization-Based Methods

Optimization-based methods attempt to reduce forgetting by directly controlling gradient updates during training. These approaches aim to prevent destructive interference between tasks by modifying optimization trajectories. Representative methods include Gradient Episodic Memory (GEM), Averaged GEM (A-GEM), Orthogonal Gradient Descent (OGD), and related gradient projection strategies. These methods provide a principled framework for balancing stability and plasticity during continual adaptation. By constraining gradient directions, optimization-based methods attempt to preserve previously learned knowledge while enabling efficient learning of new tasks.
Nevertheless, optimization-based approaches often incur substantial computational overhead due to gradient storage, projection operations, or optimization constraints. Their performance may also depend heavily on replay memory quality and gradient approximation strategies. Optimization-based methods are especially useful in scenarios where preserving task-specific gradient information is critical.

Representation Learning Approaches

Representation-learning methods focus on learning stable and transferable feature representations across tasks. Instead of solely preserving parameters or replaying data, these approaches attempt to disentangle task-specific and task-invariant representations. Recent advances in self-supervised learning and contrastive learning have significantly influenced representation-based CL. These methods aim to improve feature generalization while reducing sensitivity to task-specific distribution shifts. A key advantage of representation-learning approaches is their ability to improve forward transfer and domain generalization. Robust representations can facilitate adaptation to future tasks and reduce forgetting under moderate domain shifts. However, learning universally transferable representations remains challenging, particularly in highly heterogeneous task sequences. Representation-learning methods may still require replay mechanisms or auxiliary regularization strategies for long-term stability.

Prompt-Based and Parameter-Efficient CL

Recent years have witnessed increasing interest in applying CL to large-scale transformers and foundation models. In this context, parameter-efficient fine-tuning (PEFT) and prompt-based adaptation strategies have emerged as scalable alternatives to full model retraining. Representative methods include Learning to Prompt (L2P), DualPrompt, CODA-Prompt, adapter-based tuning, prefix tuning, and low-rank adaptation (LoRA). These approaches update only small subsets of model parameters while preserving the majority of pretrained weights.
Prompt-based CL methods offer several advantages. They reduce computational cost, improve scalability for large foundation models, and mitigate catastrophic forgetting by minimizing parameter interference. Furthermore, these methods are particularly suitable for multimodal and large language model (LLM) adaptation scenarios. Despite their promise, prompt-based methods remain relatively underexplored in realistic continual deployment settings. Several open challenges remain, including prompt interference, prompt scalability, long-term adaptation stability, and efficient prompt selection strategies.

Comparison Across CL Settings

Different CL methods exhibit varying levels of effectiveness depending on the underlying CL scenario. In task-incremental learning, architecture-based and regularization-based methods often perform well because task identity information reduces ambiguity during inference. In domain-incremental learning, representation-learning approaches and replay-based methods tend to provide stronger robustness against domain shifts. In class-incremental learning, replay-based methods generally outperform other categories because the model must simultaneously distinguish between old and new classes without access to task identity information. Online and streaming CL scenarios introduce additional constraints, including single-pass learning, limited memory, and real-time adaptation requirements. Under such settings, lightweight replay strategies and parameter-efficient adaptation methods become increasingly important.

Memory, Scalability, and Computational Trade-Offs

One of the central challenges in CL involves balancing performance with memory and computational efficiency. Replay-based methods often achieve strong retention performance but require explicit memory buffers. Architecture-based methods reduce forgetting effectively but may scale poorly due to parameter growth. Regularization-based approaches remain memory efficient but may struggle in highly dynamic environments. Prompt-based and PEFT approaches represent a promising compromise for large-scale continual adaptation because they reduce computational overhead while preserving pretrained representations. However, their long-term scalability and robustness under severe distribution shifts remain active research topics. Consequently, selecting an appropriate CL strategy depends heavily on the target application, deployment constraints, and task characteristics.
The comparative analysis reveals that CL remains fundamentally a trade-off optimization problem involving stability, plasticity, scalability, memory efficiency, and computational cost. Existing methods often excel in specific scenarios but fail to generalize universally across diverse CL environments. Future research should increasingly focus on unified CL frameworks capable of handling realistic streaming conditions, multimodal inputs, large-scale foundation models, domain shifts, and deployment-oriented constraints. In addition, greater emphasis should be placed on standardized evaluation protocols, reproducibility, and practical scalability to bridge the gap between academic benchmarks and real-world continual adaptation systems.

9. Applications of CL

CL has vast potential across various domains where systems need to learn and adapt to new information over time without forgetting prior knowledge. Below is a detailed explanation of its applications in healthcare and medical imaging, robotics and autonomous systems, NLP, recommender systems, and cybersecurity. Table 14 presents the summary of key applications of CL.

Applications in Healthcare and Medical Imaging

The healthcare domain stands to benefit immensely from the application of CL, particularly in medical imaging, where the technology can deal with the challenges of data heterogeneity, evolving diagnostic criteria, and the need for personalized treatment approaches [218]. Medical imaging datasets are often characterized by variability in image acquisition protocols, scanner types, and patient demographics, which can lead to performance degradation in traditional ML models. CL techniques can enable medical imaging systems to adapt to these variations by sequentially learning from new datasets without forgetting previously acquired knowledge [219]. One of the pivotal challenges in the deep learning field is data integration from various sources acquired using different hardware vendors, diverse acquisition protocols, experimental setups, and even inter-operator variabilities [220]. This leads to heterogeneous datasets, requiring careful harmonization before being usable to train AI algorithms. Moreover, CL can facilitate the integration of new imaging modalities or biomarkers into existing diagnostic workflows, enhancing the comprehensiveness and accuracy of clinical decision-making. Deep learning techniques have the capability to enhance diagnostic accuracy, streamline workflows, reduce interpretation time, and ultimately improve patient outcomes [221]. Deep learning algorithms integrated with NLP and computer vision can foster multimodal medical data analysis and clinical decision support systems, leading to improvements in patient care [221]. CL can also play a crucial role in the development of personalized medicine approaches by enabling models to adapt to individual patient characteristics and treatment responses over time. This ensures the AI models remain relevant and effective as new data become available[221,222].
Moreover, the static nature of conventional AI models poses a significant challenge in dynamic medical environments where conditions and protocols are ever-evolving [223]. CL enables models to adapt seamlessly to new data distributions, accommodating the introduction of novel diseases, updated diagnostic criteria, and evolving treatment modalities, thereby bolstering the long-term reliability and effectiveness of AI-driven medical solutions [221]. Deep learning algorithms, trained on extensive datasets, possess the capability to recognize intricate patterns and features that may elude the human eye, offering new insights to enhance decision-making. CL facilitates the integration of new imaging modalities or biomarkers into existing diagnostic workflows, enhancing the comprehensiveness and accuracy of clinical decision-making [224]. Despite the documented efficacy of ML and deep learning models in improving the accuracy of breast cancer diagnostics, challenges persist regarding the generalizability and robustness of these models across diverse medical imaging modalities [225]. CL addresses the critical need for models to adapt to new clinical guidelines and emerging medical knowledge, ensuring that diagnostic tools remain current and aligned with best practices.

Applications in Robotics and Autonomous Systems

Robotics and autonomous systems represent another promising area for the application of CL, with the potential to enable robots to adapt to changing environments, learn new skills, and improve their performance over time. Robots operating in real-world environments encounter a myriad of challenges, including dynamic environments, unexpected obstacles, and the need to interact with humans in a safe and intuitive manner. Catastrophic forgetting can be mitigated in robots by leveraging the locality of splines. CL algorithms can enable robots to overcome these challenges by continuously learning from their experiences and adapting their behavior accordingly. For instance, a robot deployed in a warehouse environment can learn to navigate new layouts, identify new objects, and optimize its path planning strategies through CL. In the context of autonomous driving, CL can be used to improve the perception, decision-making, and control capabilities of self-driving cars [226]. This locality ensures that the coefficients that are in far-away regions will have information about the data that needs to be preserved. Moreover, CL can enable robots to learn new skills and tasks through imitation learning or RL. CL addresses the challenge of robots operating in dynamic and unpredictable environments by enabling them to continuously adapt to new situations and tasks. Robots can also learn from human demonstrations or feedback, allowing them to acquire new skills more efficiently [227]. CL can address the problem of catastrophic forgetting, where robots lose previously learned skills when learning new ones [12]. CL offers a solution to the challenge of acquiring vast datasets for robotic learning by enabling robots to learn incrementally from limited data, thereby reducing the burden of data collection and annotation [227].

Application in Natural Language Processing

NLP stands to gain significantly from CL, particularly in scenarios involving evolving language patterns, emerging topics, and the need to adapt to diverse user preferences. In domains such as social media monitoring and customer service, the language used by individuals is constantly evolving, necessitating models that can adapt to new words, phrases, and sentiments without forgetting previously learned information [12]. CL enables NLP models to stay current with the latest trends and maintain their accuracy over time. For instance, CL can be used to train sentiment analysis models that can accurately classify the sentiment of tweets or customer reviews, even as new slang terms and emojis emerge. As we progress toward an agentic world where LLM-based agents autonomously handle specialized tasks, it becomes crucial for these models to adapt to new tasks without forgetting previously learned information [228].
Moreover, CL can be applied to improve the personalization of NLP models, allowing them to adapt to the specific language preferences and communication styles of individual users. This can enhance the user experience in applications such as chatbots, virtual assistants, and personalized content recommendation systems. CL facilitates the dynamic adaptation of language models to individual learners, providing context-aware and precise responses that cater to their unique needs [229]. CL enables NLP models to effectively address the challenges posed by evolving language patterns, emerging topics, and diverse user preferences. Various CL scenarios have been explored in NLP, including DIL, TIL, CIL, Online CL, and Continual Pre-Training [46]. Numerous methods have been adapted to these scenarios and have demonstrated effectiveness, such as: Weight Regularization: RMR-DSE [230], SRC [231]. KD: ExtendNER [232], CFID [233], CID [234], PAGeR [235], LFPT5 [236], DnR [237], CL-NMT [238], COKD [239]. Experience Replay: CFID [233], CID [234], ELLE [240], IDBR [241], MBPA++ [242], MetaMBPA++ [243], EMAR [244], DnR [237], ARPER [245], Total Recall [246]. Generative Replay: PAGeR [235], LAMOL [247], ACM [248], NER [249]. Parameter Allocation: TPEM [250]. Modular Networks: ProgModel [251]. Meta-Learning: MetaMBPA++ [243], MeLL [252], CML [253].
CL in NLP is also characterized by the widespread use of pre-trained transformer architectures, leading to the development of parameter-efficient fine-tuning techniques. These techniques enable transformers to adapt to new tasks by learning a small number of task-specific parameters. Examples include:
  • Adaptor-Tuning: Inserting fully connected layers (CPT [46], CLIF [107], AdapterCL [254], ACM [248], ADA [255]).
  • Prompt-Tuning: Using trainable prompt tokens (C-PT [256], LFPT5 [236], EMP [257]).
  • Instruction-Based Approaches: Adding short descriptive text for each task (PAGeR [235], ConTinTin [258], ENTAILMENT [259]).
Given the success of pre-trained foundation models, these techniques are increasingly being applied to CL in visual domains [214,215,216,217,255,260,261]. Applications of NLP in CL span diverse tasks, creating unique opportunities for further research. Key areas include: Dialogue Systems: [234,245,250,254,262,263]. Text Classification: [241,242,259,264]. Sentence Generation: [230,245,248]. Relation Learning: [244,253,265,266]. Neural Machine Translation: [238,239,267,268]. Named Entity Recognition: [232,236,249]. Additionally, some studies address the integration of vision and language, focusing on continual pre-training [108,269] or downstream tasks [108,270,271].

Recommender Systems

Recommender systems, which are ubiquitous in e-commerce, entertainment, and other online platforms, can also benefit significantly from CL [272]. Recommender systems use ML algorithms to predict the items or content that a user is most likely to be interested in, based on their past behavior and preferences. Recommender systems traditionally capture user interests by encoding their historical activities on the platforms [273]. However, as user preferences and interests evolve over time, recommender systems need to adapt to these changes in order to maintain their accuracy and relevance.
CL can be used to update recommender systems with new data and user feedback, without forgetting previously learned preferences [274]. Recommender systems may leverage CL to enhance their performance by personalizing recommendations, adapting to new items, and mitigating popularity bias. For example, a movie recommender system can use CL to incorporate new movie releases and user ratings, ensuring that its recommendations remain up to date and aligned with current trends. CL can enable recommender systems to adapt to changing user preferences, new items, and evolving trends, leading to more accurate and personalized recommendations. Adaptive e-learning scenarios powered by CL not only keep learners engaged, but also broaden their awareness of relevant courses [275]. Personalized services that cater to learner preferences can enhance the learning experience [276].

Cybersecurity

The field of cybersecurity faces a continuous stream of novel threats and attack vectors, requiring security systems to constantly adapt and learn in order to stay ahead of malicious actors. CL offers a promising approach to address this challenge by enabling security systems to learn from new attack patterns and vulnerabilities without forgetting previously learned knowledge. For instance, CL can be used to train intrusion detection systems that can identify new types of malware and network attacks, even if they differ significantly from previously seen threats. By leveraging the locality of splines, Kolmogorov–Arnold Networks [277] can avoid catastrophic forgetting. In cybersecurity, this is important as new threats emerge constantly. Furthermore, CL can be applied to improve the accuracy and efficiency of spam filters, fraud detection systems, and other security applications. CL algorithms, such as Knowledge-augmented neural networks, demonstrate the potential to mitigate catastrophic forgetting in neural networks.
The ability to learn continuously is critical for adapting to evolving adaptation spaces [219]. The B-splines, which are piecewise polynomial functions, offer local control. By continually updating models with new data and feedback, security systems can improve their ability to detect and prevent cyberattacks, protecting sensitive data and infrastructure. Cybersecurity systems that use CL can dynamically adjust their defense mechanisms in response to emerging threats, offering a more robust and adaptive security posture [278]. Traditional security techniques often struggle to adapt to new threats [279]. Intrusion detection systems can use ML and deep learning to proactively prevent persistent and complex external attacks [280].
Standard deep-learning methods lose their ability to learn with extended training on new data, a phenomenon known as loss of plasticity [281]. ML has the potential to significantly improve the speed and accuracy of threat detection, making it a powerful tool in the fight against cybercrime [282]. As commercial and open-source software developers improve the security of their products and organizations implement sophisticated threat detection systems, attackers are expected to use increasingly sophisticated methods to infiltrate networks [283,284]. ML-based systems have been shown to outperform traditional, human-based security monitoring systems, especially with the increasing demand for security [285].
In summary, CL enables systems in healthcare, robotics, NLP, recommender systems, and cybersecurity to adapt to new information, evolving environments, and user needs. By overcoming the limitations of traditional static models, CL enhances the efficiency, relevance, and resilience of intelligent systems, paving the way for their integration into real-world applications.

10. Open Challenges and Future Directions

Despite substantial progress, CL (CL) remains far from a mature solution for real-world adaptive intelligence. Many existing methods perform well under controlled benchmark settings but struggle with long task sequences, severe domain shifts, limited memory, privacy restrictions, unclear task boundaries, and large-scale deployment constraints. Future CL research should therefore move beyond simplified academic benchmarks and focus on scalable, reproducible, and deployment-oriented learning systems.

Catastrophic Forgetting and Long-Term Knowledge Retention

Catastrophic forgetting remains the central challenge in CL. It occurs when learning new information overwrites knowledge acquired from previous tasks. Although replay, regularization, knowledge distillation, and parameter-isolation methods have reduced forgetting in many settings, their effectiveness often decreases as task sequences become longer and more heterogeneous. Future work should focus on long-term retention under realistic task streams, where task boundaries are unclear, data distributions evolve gradually, and previous data may be unavailable due to privacy or storage constraints.

Scalability and Realistic Streaming Benchmarks

Many CL studies still rely on artificial task splits created from static datasets. While useful for controlled comparison, these settings do not fully represent real-world data streams, where new classes, domains, and concepts may emerge continuously. Future benchmarks should include realistic streaming conditions, temporal distribution shifts, noisy labels, open-set categories, class imbalance, and limited annotation availability. Evaluation protocols should clearly report task order, memory budget, validation strategy, computational cost, and variance across multiple runs. Without such standardized reporting, comparisons across CL methods remain unreliable.

Memory, Computation, and Deployment Constraints

Practical CL systems must balance knowledge retention with memory and computational efficiency. Replay-based methods often provide strong performance, but storing past samples can be infeasible in privacy-sensitive or resource-limited environments. Architecture-based methods reduce interference but may suffer from uncontrolled parameter growth. Regularization-based methods are memory-efficient but can struggle under severe distribution shifts. Future work should prioritize lightweight CL methods that support efficient training, low-latency inference, energy-aware deployment, and stable performance on edge devices, robotics platforms, wearable systems, and embedded AI applications.

CL for Foundation Models

Foundation models, including large language models, vision-language models, and large-scale vision transformers, have changed the landscape of modern AI. However, most CL research still focuses on smaller models and controlled datasets. Applying CL to foundation models introduces new challenges because full fine-tuning is computationally expensive and may damage pretrained general knowledge, while freezing the model limits adaptation. Future research should investigate how foundation models can continuously acquire new knowledge while preserving their broad generalization ability, reasoning capacity, and cross-task transfer performance.

Parameter-Efficient Continual Adaptation

Parameter-efficient fine-tuning (PEFT) methods, including adapters, LoRA, prefix tuning, and prompt-based learning, offer a promising direction for scalable CL. These methods update only a small subset of parameters and can reduce interference with pretrained knowledge. However, their long-term behavior in continual settings remains insufficiently understood. Important questions include how to select, expand, merge, or prune task-specific prompts and adapters over long task sequences. Future studies should also examine whether PEFT-based CL remains stable when task identity is unavailable during inference or when tasks overlap substantially.

Multimodal CL

Most existing CL studies focus on single-modality data, especially image classification. However, real-world intelligent systems often process multiple modalities, including images, text, audio, video, sensor signals, and clinical records. Multimodal CL introduces additional complexity because forgetting may occur within individual modalities, across modalities, or in the alignment between modalities. Future research should investigate how to preserve cross-modal representations while allowing each modality to adapt to new distributions. This direction is particularly important for vision-language models, embodied AI, medical AI, and human-centered intelligent systems.

Privacy-Preserving and Federated CL

In many practical applications, previous data cannot be stored or replayed because of privacy, legal, or institutional restrictions. This is especially important in healthcare, finance, mobile devices, and distributed edge systems. Federated CL offers one possible solution by allowing models to learn across distributed clients without centralizing raw data. However, it introduces additional challenges, including client drift, heterogeneous data distributions, communication cost, and forgetting across clients. Future work should develop privacy-preserving CL methods that jointly address knowledge retention, data protection, fairness, and communication efficiency.

CL in Medical Imaging Under Domain Shift

Medical imaging is a critical but difficult application area for CL. Models deployed in clinical environments must adapt to new scanners, imaging protocols, patient populations, annotation styles, and disease categories. However, medical CL is constrained by limited annotations, strict privacy regulations, class imbalance, and high safety requirements. Future studies should develop domain-aware CL protocols for segmentation, detection, diagnosis, and prognosis tasks. More attention is also needed on clinically meaningful evaluation, including cross-institution robustness, uncertainty estimation, failure analysis, and retention of performance on rare but clinically important cases.

Ethical, Fairness, and Safety Considerations

As CL systems continuously adapt, they may amplify biases, become harder to audit, or develop unintended behavior after deployment. These risks are especially serious in sensitive domains such as healthcare, finance, autonomous driving, and public services. Future CL systems should include mechanisms for bias monitoring, privacy protection, interpretability, uncertainty estimation, and human oversight. Responsible CL should not only optimize accuracy and forgetting metrics but also ensure fairness, transparency, robustness, and safety during long-term adaptation.

Toward Robust Lifelong AI

The long-term goal of CL is to support lifelong AI systems that can acquire, organize, and refine knowledge over extended periods. Achieving this goal requires methods that integrate stable memory, flexible adaptation, efficient resource use, and reliable decision-making. Future research should combine insights from machine learning, neuroscience, cognitive science, ethics, and human-centered AI. Progress will depend not only on stronger algorithms but also on better benchmarks, clearer evaluation standards, and more realistic deployment studies.

Summary

Future progress in CL will depend on shifting from simplified benchmark performance toward robust lifelong adaptation. The most important directions include long-term knowledge retention, realistic streaming evaluation, foundation-model-based CL, parameter-efficient adaptation, multimodal learning, privacy-preserving methods, and medical CL under domain shift. Addressing these challenges is essential for developing CL systems that are accurate, scalable, safe, and reliable in real-world environments.
Figure 7. CIL employs several key strategies to alleviate catastrophic forgetting by targeting different representation levels. In the data space, experience replay is commonly used to retain and revisit past samples. In both the feature space and the label space, KD serves as an effective technique by preserving the informative patterns and output behavior of previous models. Together, these approaches help maintain prior knowledge while integrating new classes over time.
Figure 7. CIL employs several key strategies to alleviate catastrophic forgetting by targeting different representation levels. In the data space, experience replay is commonly used to retain and revisit past samples. In both the feature space and the label space, KD serves as an effective technique by preserving the informative patterns and output behavior of previous models. Together, these approaches help maintain prior knowledge while integrating new classes over time.
Preprints 214096 g007
Table 15. comparative view of the challenges in CL, their impacts, examples, and potential solutions.
Table 15. comparative view of the challenges in CL, their impacts, examples, and potential solutions.
Challenge Description Impact Examples Potential Solutions
Catastrophic
Forgetting
Overwriting of previous
knowledge when
learning new tasks.
Loss of performance on
earlier tasks, limiting
multi-task applications.
A model trained on new
object classes forgets
previously learned ones.
Replay methods,
regularization techniques
(e.g., EWC, SI),
parameter isolation
(e.g., PNNs, PackNet).
Scalability to
Real-World
Tasks
Difficulty in handling
diverse, undefined, and
open-ended tasks
found in real-world
environments.
Limits practical
applications, especially
in dynamic or multi-
domain environments.
A robot operating in a
dynamic home environment
fails to generalize across
diverse tasks.
Dynamic architectures
(e.g., expandable networks),
meta-learning, unsupervised
task detection.
Memory
Constraints
Storing data from
previous tasks is often
infeasible for large-scale
or resource-limited
applications.
Limits model ability to
effectively retain and
replay past information.
Replay-based methods
requiring storage of
vast datasets for
continual adaptation.
Efficient memory management
techniques, synthetic replay
using generative models, data
pruning.
Computational
Overhead
Increased computational
demands for training
and inference due to replay,
regularization, or parameter
isolation techniques.
Hinders real-time
applications on edge
devices or systems
with limited resources.
On-device CL
in IoT systems is slowed by
high computational
requirements.
Lightweight models, parameter
optimization, pruning, and
efficient task-specific parameter
allocation.
Bias
Amplification
Sequential learning may
reinforce biases present
in earlier data or tasks.
Skewed model behavior,
disproportionately
affecting certain
demographic groups.
A financial model favoring
certain demographics due
to biased historical data.
Fairness-aware training, regular
bias audits, diversity-focused
data augmentation.
Transparency
and
Explainability
Models evolving continuously
can become opaque, making
their decision-making hard
to interpret.
Erodes trust, particularly
in sensitive applications
like healthcare or finance.
Difficulty auditing a
continually adapting
medical diagnostic
system.
Explainability frameworks,
interpretable architecture
designs, and model debugging
tools.
Privacy
Concerns
Replay-based methods
storing or processing user
data may violate privacy
regulations.
Non-compliance with
privacy laws (e.g., GDPR,
HIPAA), leading to legal
and ethical implications.
Retaining user data for
replay in recommendation
systems could breach user
consent.
Privacy-preserving methods
like federated learning,
data anonymization, and
synthetic data generation.
Unintended
Consequences
Autonomous learning systems
may exhibit behaviors or
decisions not aligned with
human intentions or societal
norms.
Potential safety risks,
ethical conflicts, or
misaligned system
behavior in real-world
scenarios.
A self-learning robot adopts
unsafe behaviors while
optimizing a task
autonomously.
Strict behavioral constraints,
ethical guidelines for
autonomous systems, and
robust oversight mechanisms
during model deployment.

11. Conclusion

CL (CL), also referred to as lifelong learning, aims to develop intelligent systems capable of learning continuously from sequential data while retaining previously acquired knowledge. As AI systems are increasingly deployed in dynamic real-world environments, the ability to adapt without catastrophic forgetting has become a critical requirement across domains such as computer vision, natural language processing, robotics, healthcare, autonomous systems, and multimodal AI. This review provided a structured overview of major CL paradigms, including task-incremental, domain-incremental, class-incremental, online, multimodal, and federated learning settings. We discussed the theoretical foundations of CL, particularly the stability-plasticity dilemma, catastrophic forgetting, transfer learning dynamics, and representation learning. Furthermore, we analyzed major methodological categories, including regularization-based, replay-based, architecture-based, optimization-based, representation-learning, and parameter-efficient approaches. Recent developments involving transformers, prompt learning, foundation models, and multimodal adaptation were also discussed as emerging directions in modern CL research. In addition, this review highlighted the importance of evaluation protocols, benchmark design, memory constraints, computational efficiency, and reproducibility in fair comparison across CL methods. Existing studies often rely on inconsistent task splits, memory budgets, and evaluation settings, making direct comparison difficult. The review therefore emphasized the need for standardized benchmarks and more realistic streaming evaluation protocols that better reflect real-world deployment conditions. Despite substantial progress, several open challenges remain unresolved. Current CL systems still struggle with long-term knowledge retention, severe domain shifts, scalability under large task sequences, privacy constraints, and efficient adaptation of large foundation models. Emerging research directions such as parameter-efficient continual adaptation, multimodal CL, federated CL, and medical CL under domain shift are expected to play an increasingly important role in future research. Overall, CL remains a rapidly evolving research field with significant potential for enabling adaptive and robust AI systems. Future progress will depend not only on stronger algorithms but also on realistic evaluation practices, scalable deployment strategies, efficient adaptation mechanisms, and reliable long-term learning under dynamic environments.

Author Contributions

Conceptualization, Z.U. and J.K.; Methodology, Z.U.; Formal analysis, Z.U. and J.K.; Investigation, Z.U. and J.K.; Writing—original draft preparation, Z.U.; Writing—review and editing, Z.U.; visualization, M.H.; supervision, J.K.; project administration, J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2025-RS-2020-II201789), and the Artificial Intelligence Convergence Innovation Human Resources Development(IITP-2025-RS-2023-00254592) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation).

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2025-RS-2020-II201789), and the Artificial Intelligence Convergence Innovation Human Resources Development(IITP-2025-RS-2023-00254592) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Azizi, A.; Zhang, Z.; Hua, W.; Li, M.; Igathinathane, C.; Yang, L.; Ampatzidis, Y.; Ghasemi-Varnamkhasti, M.; Zhang, M.; Li, H.; et al. Image processing and artificial intelligence for apple detection and localization: A comprehensive review. Comput. Sci. Rev. 2024, 54, 100690. [Google Scholar] [CrossRef]
  2. Annan, R.; Qingge, L. Artificial intelligence in COVID-19 research: A comprehensive survey of innovations, challenges, and future directions. Comput. Sci. Rev. 2025, 57, 100751. [Google Scholar] [CrossRef]
  3. Herrera, F. Reflections and attentiveness on eXplainable Artificial Intelligence (XAI). The journey ahead from criticisms to human–AI collaboration. Inf. Fusion 2025, 121, 103133. [Google Scholar]
  4. Utomo, S.; Pratap, A.; Karthikeyan, P.; Ayeelyan, J.; Hsu, H.C.; Hsiung, P.A. When explainable artificial intelligence meets data governance: Enhancing trustworthiness in multimodal gas classification. Inf. Fusion 2025, 103440. [Google Scholar] [CrossRef]
  5. Górriz, J.M.; Álvarez-Illán, I.; Álvarez-Marquina, A.; Arco, J.E.; Atzmueller, M.; Ballarini, F.; Barakova, E.; Bologna, G.; Bonomini, P.; Castellanos-Dominguez, G.; et al. Computational approaches to explainable artificial intelligence: advances in theory, applications and trends. Inf. Fusion 2023, 100, 101945. [Google Scholar] [CrossRef]
  6. Longo, L.; Brcic, M.; Cabitza, F.; Choi, J.; Confalonieri, R.; Del Ser, J.; Guidotti, R.; Hayashi, Y.; Herrera, F.; Holzinger, A.; et al. Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions. Inf. Fusion 2024, 106, 102301. [Google Scholar] [CrossRef]
  7. Rezaee, K. Machine learning in automated diagnosis of autism spectrum disorder: A comprehensive review. Comput. Sci. Rev. 2025, 56, 100730. [Google Scholar] [CrossRef]
  8. Naser, M. From failure to fusion: A survey on learning from bad machine learning models. Inf. Fusion 2025, 120, 103122. [Google Scholar] [CrossRef]
  9. Escovedo, T.; Koshiyama, A.; da Cruz, A.A.; Vellasco, M. Neuroevolutionary learning in nonstationary environments. Appl. Intell. 2020, 50, 1590–1608. [Google Scholar] [CrossRef]
  10. Criado, M.F.; Casado, F.E.; Iglesias, R.; Regueiro, C.V.; Barro, S. Non-iid data and continual learning processes in federated learning: A long road ahead. Inf. Fusion 2022, 88, 263–280. [Google Scholar] [CrossRef]
  11. Nguyen, C.V.; Achille, A.; Lam, M.; Hassner, T.; Mahadevan, V.; Soatto, S. Toward understanding catastrophic forgetting in continual learning. arXiv 2019, arXiv:1908.01091. [Google Scholar] [CrossRef]
  12. ParisiGerman, I.; PartJose, L.; et al. Continual lifelong learning with neural networks. 2019. [Google Scholar] [CrossRef]
  13. Wang, L.; Zhang, X.; Su, H.; Zhu, J. A comprehensive survey of continual learning: Theory, method and application. In IEEE Transactions on Pattern Analysis and Machine Intelligence; 2024. [Google Scholar]
  14. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
  15. Xu, X.; Chen, J.; Thakur, D.; Hong, D. Multi-modal disease segmentation with continual learning and adaptive decision fusion. Inf. Fusion 2025, 102962. [Google Scholar]
  16. Wu, Y.; Li, Z.; Gao, Y.; Chiclana, F.; Chen, X.; Dong, Y. An endogenous and continual learning approach to personalize individual semantics to support linguistic consensus reaching. Inf. Fusion 2025, 114, 102640. [Google Scholar] [CrossRef]
  17. Yu, Y.; Du, Z.; Meng, L.; Li, J.; Hu, J. Adaptive online continual multi-view learning. Inf. Fusion 2024, 103, 102020. [Google Scholar] [CrossRef]
  18. Calvaresi, D.; Calbimonte, J.P. Real-time compliant stream processing agents for physical rehabilitation. Sensors 2020, 20, 746. [Google Scholar] [CrossRef]
  19. Shahrivari, S. Beyond batch processing: towards real-time and streaming big data. Computers 2014, 3, 117–129. [Google Scholar] [CrossRef]
  20. Parisi, G.I.; Lomonaco, V. Online continual learning on sequences. In Proceedings of the Recent Trends in Learning From Data: Tutorials from the INNS Big Data and Deep Learning Conference (INNSBDDL2019). Springer, 2020, pp. 197–221.
  21. Van de Ven, G.M.; Tuytelaars, T.; Tolias, A.S. Three types of incremental learning. Nat. Mach. Intell. 2022, 4, 1185–1197. [Google Scholar] [CrossRef]
  22. Bidaki, S.A.; Mohammadkhah, A.; Rezaee, K.; Hassani, F.; Eskandari, S.; Salahi, M.; Ghassemi, M.M. Online continual learning: A systematic literature review of approaches, challenges, and benchmarks. arXiv 2025, arXiv:2501.04897. [Google Scholar] [CrossRef]
  23. Zhou, D.W.; Wang, Q.W.; Qi, Z.H.; Ye, H.J.; Zhan, D.C.; Liu, Z. Class-incremental learning: A survey. In IEEE Transactions on Pattern Analysis and Machine Intelligence; 2024. [Google Scholar]
  24. Wickramasinghe, B.; Saha, G.; Roy, K. Continual learning: A review of techniques, challenges, and future directions. IEEE Trans. Artif. Intell. 2023, 5, 2526–2546. [Google Scholar] [CrossRef]
  25. Thrun, S.; Mitchell, T.M. Lifelong robot learning. Robot. Auton. Syst. 1995, 15, 25–46. [Google Scholar] [CrossRef]
  26. Tan, A.; Wang, Y.; Wu, W.Z.; Ding, W.; Liang, J. Multi-View Fusion Graph Attention Network for multilabel class incremental learning. Inf. Fusion 2025, 103309. [Google Scholar] [CrossRef]
  27. Li, D.; Wang, T.; Chen, J.; Kawaguchi, K.; Lian, C.; Zeng, Z. Multi-view class incremental learning. Inf. Fusion 2024, 102, 102021. [Google Scholar] [CrossRef]
  28. Zheng, Y.; Zhang, X.; Tian, Z.; Du, S. Enhancing few-shot lifelong learning through fusion of cross-domain knowledge. Inf. Fusion 2025, 115, 102730. [Google Scholar] [CrossRef]
  29. Mehta, S.V.; Patil, D.; Chandar, S.; Strubell, E. An empirical investigation of the role of pre-training in lifelong learning. J. Mach. Learn. Res. 2023, 24, 1–50. [Google Scholar]
  30. Kanakis, M.; Bruggemann, D.; Saha, S.; Georgoulis, S.; Obukhov, A.; Van Gool, L. Reparameterizing convolutions for incremental multi-task learning without task interference. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer, 2020, pp. 689–707.
  31. Vödisch, N.; Cattaneo, D.; Burgard, W.; Valada, A. Covio: Online continual learning for visual-inertial odometry. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2464–2473.
  32. Ullah, Z.; Usman, M.; Gwak, J. MTSS-AAE: Multi-task semi-supervised adversarial autoencoding for COVID-19 detection based on chest X-ray images. Expert Syst. With Appl. 2023, 216, 119475. [Google Scholar] [CrossRef]
  33. Bonicelli, L.; Boschini, M.; Frascaroli, E.; Porrello, A.; Pennisi, M.; Bellitto, G.; Palazzo, S.; Spampinato, C.; Calderara, S. On the effectiveness of equivariant regularization for robust online continual learning. arXiv 2023, arXiv:2305.03648. [Google Scholar] [CrossRef]
  34. Ali, S.; Abuhmed, T.; El-Sappagh, S.; Muhammad, K.; Alonso-Moral, J.M.; Confalonieri, R.; Guidotti, R.; Del Ser, J.; Díaz-Rodríguez, N.; Herrera, F. Explainable Artificial Intelligence (XAI): What we know and what is left to attain Trustworthy Artificial Intelligence. Inf. Fusion 2023, 99, 101805. [Google Scholar] [CrossRef]
  35. Abbass, H. What is artificial intelligence? IEEE Trans. Artif. Intell. 2021, 2, 94–95. [Google Scholar] [CrossRef]
  36. Smith, P.D. Hands-On Artificial Intelligence for Beginners: An introduction to AI concepts, algorithms, and their implementation; Packt Publishing Ltd, 2018. [Google Scholar]
  37. Chen, Z.; Liu, B. Lifelong machine learning; Morgan & Claypool Publishers, 2018. [Google Scholar]
  38. Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71. [Google Scholar] [CrossRef] [PubMed]
  39. Hayes, T.L.; Krishnan, G.P.; Bazhenov, M.; Siegelmann, H.T.; Sejnowski, T.J.; Kanan, C. Replay in deep learning: Current approaches and missing biological elements. Neural Comput. 2021, 33, 2908–2950. [Google Scholar] [CrossRef] [PubMed]
  40. Kudithipudi, D.; Aguilar-Simon, M.; Babb, J.; Bazhenov, M.; Blackiston, D.; Bongard, J.; Brna, A.P.; Chakravarthi Raja, S.; Cheney, N.; Clune, J.; et al. Biological underpinnings for lifelong learning machines. Nat. Mach. Intell. 2022, 4, 196–210. [Google Scholar] [CrossRef]
  41. Hadsell, R.; Rao, D.; Rusu, A.A.; Pascanu, R. Embracing change: Continual learning in deep neural networks. Trends Cogn. Sci. 2020, 24, 1028–1040. [Google Scholar] [CrossRef] [PubMed]
  42. Qu, H.; Rahmani, H.; Xu, L.; Williams, B.; Liu, J. Recent advances of continual learning in computer vision: An overview. IET Comput. Vis. 2025, 19, e70013. [Google Scholar] [CrossRef]
  43. Mai, Z.; Li, R.; Jeong, J.; Quispe, D.; Kim, H.; Sanner, S. Online continual learning in image classification: An empirical survey. Neurocomputing 2022, 469, 28–51. [Google Scholar] [CrossRef]
  44. Masana, M.; Twardowski, B.; Van de Weijer, J. On class orderings for incremental learning. arXiv 2020, arXiv:2007.02145. [Google Scholar] [CrossRef]
  45. Biesialska, M.; Biesialska, K.; Costa-Jussa, M.R. Continual lifelong learning in natural language processing: A survey. arXiv 2020, arXiv:2012.09823. [Google Scholar] [CrossRef]
  46. Ke, Z.; Liu, B. Continual learning of natural language processing tasks: A survey. arXiv 2022, arXiv:2211.12701. [Google Scholar]
  47. Khetarpal, K.; Riemer, M.; Rish, I.; Precup, D. Towards continual reinforcement learning: A review and perspectives. J. Artif. Intell. Res. 2022, 75, 1401–1476. [Google Scholar] [CrossRef]
  48. Ghosh, S. Dynamic vaes with generative replay for continual zero-shot learning. arXiv 2021, arXiv:2104.12468. [Google Scholar] [CrossRef]
  49. Singh, P.; Mazumder, P.; Rai, P.; Namboodiri, V.P. Rectification-based knowledge retention for continual learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15282–15291.
  50. Tao, X.; Hong, X.; Chang, X.; Dong, S.; Wei, X.; Gong, Y. Few-shot class-incremental learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020; pp. 12183–12192. [Google Scholar]
  51. Wang, L.; Yang, K.; Li, C.; Hong, L.; Li, Z.; Zhu, J. Ordisco: Effective and efficient usage of incremental unlabeled data for semi-supervised continual learning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021; pp. 5383–5392. [Google Scholar]
  52. Joseph, K.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Towards open world object detection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 5830–5840. [Google Scholar]
  53. Wang, Q.F.; Geng, X.; Lin, S.X.; Xia, S.Y.; Qi, L.; Xu, N. Learngene: From open-world to your learning task. Proc. Proc. AAAI Conf. Artif. Intell. 2022, Vol. 36, 8557–8565. [Google Scholar]
  54. Hu, D.; Yan, S.; Lu, Q.; Hong, L.; Hu, H.; Zhang, Y.; Li, Z.; Wang, X.; Feng, J. How well does self-supervised pre-training perform with streaming data? arXiv 2021, arXiv:2104.12081. [Google Scholar]
  55. Rao, D.; Visin, F.; Rusu, A.; Pascanu, R.; Teh, Y.W.; Hadsell, R. Continual unsupervised representation learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  56. Ruvolo, P.; Eaton, E. ELLA: An efficient lifelong learning algorithm. In Proceedings of the International conference on machine learning. PMLR; 2013; pp. 507–515. [Google Scholar]
  57. Masse, N.Y.; Grant, G.D.; Freedman, D.J. Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proc. Natl. Acad. Sci. 2018, 115, E10467–E10475. [Google Scholar] [CrossRef]
  58. Ramesh, R.; Chaudhari, P. Model zoo: A growing" brain" that learns continually. arXiv 2021, arXiv:2106.03027. [Google Scholar]
  59. PourKeshavarzi, M.; Zhao, G.; Sabokrou, M. Looking back on learned experiences for class/task incremental learning. In Proceedings of the International Conference on Learning Representations; 2021. [Google Scholar]
  60. Xie, X.; Xu, J.; Hu, P.; Zhang, W.; Huang, Y.; Zheng, W.; Wang, R. Task-incremental medical image classification with task-specific batch normalization. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV); 2023; Springer; pp. 309–320. [Google Scholar]
  61. Feng, F.; Chan, R.H.; Shi, X.; Zhang, Y.; She, Q. Challenges in task incremental learning for assistive robotics. IEEE Access 2019, 8, 3434–3441. [Google Scholar] [CrossRef]
  62. Lopez-Paz, D.; Ranzato, M. Gradient episodic memory for continual learning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  63. Vogelstein, J.T.; Dey, J.; Helm, H.S.; LeVine, W.; Mehta, R.D.; Tomita, T.M.; Xu, H.; Geisa, A.; Wang, Q.; van de Ven, G.M.; et al. A Simple Lifelong Learning Approach. arXiv 2020, arXiv:2004.12908. [Google Scholar]
  64. Ke, Z.; Liu, B.; Xu, H.; Shu, L. CLASSIC: Continual and contrastive learning of aspect sentiment classification tasks. arXiv 2021, arXiv:2112.02714. [Google Scholar] [CrossRef]
  65. Mirza, M.J.; Masana, M.; Possegger, H.; Bischof, H. An efficient domain-incremental learning approach to drive in all weather conditions. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022; pp. 3001–3011. [Google Scholar]
  66. Aljundi, R.; Chakravarty, P.; Tuytelaars, T. Expert gate: Lifelong learning with a network of experts. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition; 2017; pp. 3366–3375. [Google Scholar]
  67. Von Oswald, J.; Henning, C.; Grewe, B.F.; Sacramento, J. Continual learning with hypernetworks. arXiv 2019, arXiv:1906.00695. [Google Scholar]
  68. Verma, V.K.; Liang, K.J.; Mehta, N.; Rai, P.; Carin, L. Efficient feature transformations for discriminative and generative continual learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 13865–13875. [Google Scholar]
  69. Lomonaco, V.; Maltoni, D. Core50: a new dataset and benchmark for continuous object recognition. In Proceedings of the Conference on robot learning. PMLR; 2017; pp. 17–26. [Google Scholar]
  70. Garg, P.; Saluja, R.; Balasubramanian, V.N.; Arora, C.; Subramanian, A.; Jawahar, C. Multi-domain incremental learning for semantic segmentation. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2022; pp. 761–771. [Google Scholar]
  71. Capuano, N.; Greco, L.; Ritrovato, P.; Vento, M. Sentiment analysis for customer relationship management: an incremental learning approach. Appl. Intell. 2021, 51, 3339–3352. [Google Scholar] [CrossRef]
  72. Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C.H. icarl: Incremental classifier and representation learning. In Proceedings of the Proceedings of the IEEE conference on Computer Vision and Pattern Recognition; 2017; pp. 2001–2010. [Google Scholar]
  73. Shin, H.; Lee, J.K.; Kim, J.; Kim, J. Continual learning with deep generative replay. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  74. Van de Ven, G.M.; Siegelmann, H.T.; Tolias, A.S. Brain-inspired replay for continual learning with artificial neural networks. Nat. Commun. 2020, 11, 4069. [Google Scholar] [CrossRef]
  75. Zhou, D.W.; Yang, Y.; Zhan, D.C. Learning to classify with incremental new class. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 2429–2443. [Google Scholar] [CrossRef]
  76. Belouadah, E.; Popescu, A.; Kanellos, I. A comprehensive study of class incremental learning algorithms for visual tasks. Neural Netw. 2021, 135, 38–54. [Google Scholar] [CrossRef]
  77. Masana, M.; Liu, X.; Twardowski, B.; Menta, M.; Bagdanov, A.D.; Van De Weijer, J. Class-incremental learning: survey and performance evaluation on image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5513–5533. [Google Scholar] [CrossRef]
  78. Channappayya, S.; Tamma, B.R.; et al. Augmented memory replay-based continual learning approaches for network intrusion detection. Adv. Neural Inf. Process. Syst. 2023, 36, 17156–17169. [Google Scholar]
  79. Li, X.; Wang, S.; Sun, J.; Xu, Z. Variational data-free knowledge distillation for continual learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12618–12634. [Google Scholar] [CrossRef]
  80. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
  81. Park, J.; Kang, M.; Han, B. Class-incremental learning for action recognition in videos. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision; 2021; pp. 13698–13707. [Google Scholar]
  82. Villa, A.; Alhamoud, K.; Escorcia, V.; Caba, F.; Alcázar, J.L.; Ghanem, B. vclimb: A novel video class incremental learning benchmark. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 19035–19044. [Google Scholar]
  83. Shmelkov, K.; Schmid, C.; Alahari, K. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the Proceedings of the IEEE international conference on computer vision; 2017; pp. 3400–3409. [Google Scholar]
  84. Girshick, R. Fast r-cnn. In Proceedings of the Proceedings of the IEEE international conference on computer vision; 2015; pp. 1440–1448. [Google Scholar]
  85. Ramakrishnan, K.; Panda, R.; Fan, Q.; Henning, J.; Oliva, A.; Feris, R. Relationship matters: Relation guided knowledge transfer for incremental learning of object detectors. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops; 2020; pp. 250–251. [Google Scholar]
  86. Paik, I.; Oh, S.; Kwak, T.; Kim, I. Overcoming catastrophic forgetting by neuron-level plasticity control. Proc. Proc. AAAI Conf. Artif. Intell. 2020, Vol. 34, 5339–5346. [Google Scholar] [CrossRef]
  87. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  88. Li, D.; Tasci, S.; Ghosh, S.; Zhu, J.; Zhang, J.; Heck, L. RILOD: Near real-time incremental learning for object detection at the edge. In Proceedings of the Proceedings of the 4th ACM/IEEE Symposium on Edge Computing; 2019; pp. 113–126. [Google Scholar]
  89. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the Proceedings of the IEEE international conference on computer vision; 2017; pp. 2980–2988. [Google Scholar]
  90. Feng, T.; Wang, M.; Yuan, H. Overcoming catastrophic forgetting in incremental object detection via elastic response distillation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 9427–9436. [Google Scholar]
  91. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
  92. Hao, Y.; Fu, Y.; Jiang, Y.G.; Tian, Q. An end-to-end architecture for class-incremental object detection with knowledge distillation. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME); IEEE, 2019; pp. 1–6. [Google Scholar]
  93. Peng, C.; Zhao, K.; Lovell, B.C. Faster ilod: Incremental learning for object detectors based on faster rcnn. Pattern Recognit. Lett. 2020, 140, 109–115. [Google Scholar] [CrossRef]
  94. Zhang, J.; Zhang, J.; Ghosh, S.; Li, D.; Tasci, S.; Heck, L.; Zhang, H.; Kuo, C.C.J. Class-incremental learning via deep model consolidation. In Proceedings of the Proceedings of the IEEE/CVF winter conference on applications of computer vision; 2020; pp. 1131–1140. [Google Scholar]
  95. Dong, N.; Zhang, Y.; Ding, M.; Lee, G.H. Bridging non co-occurrence with unlabeled in-the-wild data for incremental object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 30492–30503. [Google Scholar]
  96. Joseph, K.; Rajasegaran, J.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Incremental object detection via meta-learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9209–9216. [Google Scholar] [CrossRef]
  97. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
  98. Zhao, N.; Lee, G.H. Static-dynamic co-teaching for class-incremental 3d object detection. Proc. Proc. AAAI Conf. Artif. Intell. 2022, Vol. 36, 3436–3445. [Google Scholar] [CrossRef]
  99. Wang, J.; Wang, X.; Shang-Guan, Y.; Gupta, A. Wanderlust: Online continual object detection in the real world. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision; 2021; pp. 10829–10838. [Google Scholar]
  100. Perez-Rua, J.M.; Zhu, X.; Hospedales, T.M.; Xiang, T. Incremental few-shot object detection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020; pp. 13846–13855. [Google Scholar]
  101. Feng, J.; Phillips, R.V.; Malenica, I.; Bishara, A.; Hubbard, A.E.; Celi, L.A.; Pirracchio, R. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digit. Med. 2022, 5, 66. [Google Scholar] [CrossRef]
  102. Chrisley, R. Embodied artificial intelligence. Artif. Intell. 2003, 149, 131–150. [Google Scholar] [CrossRef]
  103. Duan, J.; Yu, S.; Tan, H.L.; Zhu, H.; Tan, C. A survey of embodied ai: From simulators to research tasks. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 230–244. [Google Scholar] [CrossRef]
  104. Franklin, S. Autonomous agents as embodied AI. Cybern. Syst. 1997, 28, 499–520. [Google Scholar] [CrossRef]
  105. Shi, G.; Wu, Y.; Liu, J.; Wan, S.; Wang, W.; Lu, T. Incremental few-shot semantic segmentation via embedding adaptive-update and hyper-class representation. In Proceedings of the Proceedings of the 30th ACM international conference on multimedia; 2022; pp. 5547–5556. [Google Scholar]
  106. Ganea, D.A.; Boom, B.; Poppe, R. Incremental few-shot instance segmentation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 1185–1194. [Google Scholar]
  107. Jin, X.; Lin, B.Y.; Rostami, M.; Ren, X. Learn continually, generalize rapidly: Lifelong knowledge accumulation for few-shot learning. arXiv 2021, arXiv:2104.08808. [Google Scholar]
  108. Cossu, A.; Carta, A.; Passaro, L.; Lomonaco, V.; Tuytelaars, T.; Bacciu, D. Continual pre-training mitigates forgetting in language and vision. Neural Netw. 2024, 179, 106492. [Google Scholar] [CrossRef]
  109. Madaan, D.; Yoon, J.; Li, Y.; Liu, Y.; Hwang, S.J. Representational continuity for unsupervised continual learning. arXiv 2021, arXiv:2110.06976. [Google Scholar]
  110. Riemer, M.; Cases, I.; Ajemian, R.; Liu, M.; Rish, I.; Tu, Y.; Tesauro, G. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv 2018, arXiv:1810.11910. [Google Scholar]
  111. Guo, Q.; Zhao, W.; Lyu, Z.; Zhao, T. A GAN enhanced meta-deep reinforcement learning approach for DCN routing optimization. Inf. Fusion 2025, 121, 103160. [Google Scholar] [CrossRef]
  112. Zhao, Y.; Zhong, Z.; Yang, F.; Luo, Z.; Lin, Y.; Li, S.; Sebe, N. Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 6277–6286. [Google Scholar]
  113. Javed, K.; White, M. Meta-learning representations for continual learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  114. Beaulieu, S.; Frati, L.; Miconi, T.; Lehman, J.; Stanley, K.O.; Clune, J.; Cheney, N. Learning to continually learn. In ECAI 2020; IOS Press, 2020; pp. 992–1001. [Google Scholar]
  115. Lee, E.; Huang, C.H.; Lee, C.Y. Few-shot and continual learning with attentive independent mechanisms. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision; 2021; pp. 9455–9464. [Google Scholar]
  116. Rajasegaran, J.; Khan, S.; Hayat, M.; Khan, F.S.; Shah, M. itaml: An incremental task-agnostic meta-learning approach. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020; pp. 13588–13597. [Google Scholar]
  117. Gupta, G.; Yadav, K.; Paull, L. Look-ahead meta learning for continual learning. Adv. Neural Inf. Process. Syst. 2020, 33, 11588–11598. [Google Scholar]
  118. Caccia, L.; Belilovsky, E.; Caccia, M.; Pineau, J. Online learned continual compression with adaptive quantization modules. In Proceedings of the International conference on machine learning. PMLR; 2020; pp. 1240–1250. [Google Scholar]
  119. KJ, J.; N Balasubramanian, V. Meta-consolidation for continual learning. Adv. Neural Inf. Process. Syst. 2020, 33, 14374–14386. [Google Scholar]
  120. Henning, C.; Cervera, M.; D’Angelo, F.; Von Oswald, J.; Traber, R.; Ehret, B.; Kobayashi, S.; Grewe, B.F.; Sacramento, J. Posterior meta-replay for continual learning. Adv. Neural Inf. Process. Syst. 2021, 34, 14135–14149. [Google Scholar]
  121. Hurtado, J.; Raymond, A.; Soto, A. Optimizing reusable knowledge for continual learning via metalearning. Adv. Neural Inf. Process. Syst. 2021, 34, 14150–14162. [Google Scholar]
  122. Wang, R.; Bao, Y.; Zhang, B.; Liu, J.; Zhu, W.; Guo, G. Anti-retroactive interference for lifelong learning. In Proceedings of the European Conference on Computer Vision; 2022; Springer; pp. 163–178. [Google Scholar]
  123. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial intelligence and statistics. PMLR; 2017; pp. 1273–1282. [Google Scholar]
  124. Yoon, J.; Jeong, W.; Lee, G.; Yang, E.; Hwang, S.J. Federated continual learning with weighted inter-client transfer. In Proceedings of the International Conference on Machine Learning. PMLR; 2021; pp. 12073–12086. [Google Scholar]
  125. Usmanova, A.; Portet, F.; Lalanda, P.; Vega, G. A distillation-based approach integrating continual learning and federated learning for pervasive services. arXiv 2021, arXiv:2109.04197. [Google Scholar] [CrossRef]
  126. Park, T.J.; Kumatani, K.; Dimitriadis, D. Tackling dynamics in federated incremental learning with variational embedding rehearsal. arXiv 2021, arXiv:2110.09695. [Google Scholar] [CrossRef]
  127. Mermillod, M.; Bugaiska, A.; Bonin, P. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. 2013. [Google Scholar] [CrossRef]
  128. Grossberg, S. Adaptive Resonance Theory: How a brain learns to consciously attend, learn, and recognize a changing world. Neural Netw. 2013, 37, 1–47. [Google Scholar] [CrossRef]
  129. Abraham, W.C.; Robins, A. Memory retention–the synaptic stability versus plasticity dilemma. Trends Neurosci. 2005, 28, 73–78. [Google Scholar] [CrossRef]
  130. Hebb, D.O. The organization of behavior: A neuropsychological theory; Psychology press, 2005. [Google Scholar]
  131. Power, J.D.; Schlaggar, B.L. Neural plasticity across the lifespan. Wiley Interdiscip. Rev. Dev. Biol. 2017, 6, e216. [Google Scholar] [CrossRef]
  132. McCloskey, M.; Cohen, N.J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation; Elsevier, 1989; Vol. 24, pp. 109–165. [Google Scholar]
  133. Ratcliff, R. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychol. Rev. 1990, 97, 285. [Google Scholar] [CrossRef]
  134. Zhao, J.; Zhang, X.; Zhao, B.; Hu, W.; Diao, T.; Wang, L.; Zhong, Y.; Li, Q. Genetic dissection of mutual interference between two consecutive learning tasks in Drosophila. Elife 2023, 12, e83516. [Google Scholar] [CrossRef]
  135. Hayashi-Takagi, A.; Yagishita, S.; Nakamura, M.; Shirai, F.; Wu, Y.I.; Loshbaugh, A.L.; Kuhlman, B.; Hahn, K.M.; Kasai, H. Labelling and optical erasure of synaptic memory traces in the motor cortex. Nature 2015, 525, 333–338. [Google Scholar] [CrossRef] [PubMed]
  136. Yang, G.; Pan, F.; Gan, W.B. Stably maintained dendritic spines are associated with lifelong memories. Nature 2009, 462, 920–924. [Google Scholar] [CrossRef] [PubMed]
  137. Zhang, X.; Li, Q.; Wang, L.; Liu, Z.J.; Zhong, Y. Active protection: learning-activated Raf/MAPK activity protects labile memory from Rac1-independent forgetting. Neuron 2018, 98, 142–155. [Google Scholar] [CrossRef]
  138. Huszár, F. Note on the quadratic penalties in elastic weight consolidation. Proc. Natl. Acad. Sci. 2018, 115, E2496–E2497. [Google Scholar] [CrossRef] [PubMed]
  139. McNaughton, B.L.; O’Reilly, R.C. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of. Psychol. Rev. 1995, 102, 419–457. [Google Scholar] [CrossRef]
  140. Graves, L.; Nagisetty, V.; Ganesh, V. Does AI remember? neural networks and the right to be forgotten. In Neural Networks and the Right to be Forgotten; 2020. [Google Scholar]
  141. Ding, M.; Ji, K.; Wang, D.; Xu, J. Understanding forgetting in continual learning with linear regression. arXiv 2024, arXiv:2405.17583. [Google Scholar] [CrossRef]
  142. Aljundi, R.; Lin, M.; Goujaud, B.; Bengio, Y. Gradient based sample selection for online continual learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  143. Ritter, H.; Botev, A.; Barber, D. Online structured laplace approximations for overcoming catastrophic forgetting. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  144. Schwarz, J.; Czarnecki, W.; Luketina, J.; Grabska-Barwinska, A.; Teh, Y.W.; Pascanu, R.; Hadsell, R. Progress & compress: A scalable framework for continual learning. In Proceedings of the International conference on machine learning. PMLR; 2018; pp. 4528–4537. [Google Scholar]
  145. Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
  146. Dhar, P.; Singh, R.V.; Peng, K.C.; Wu, Z.; Chellappa, R. Learning without memorizing. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019; pp. 5138–5146. [Google Scholar]
  147. Iscen, A.; Zhang, J.; Lazebnik, S.; Schmid, C. Memory-efficient incremental learning through feature adaptation. In Proceedings of the European conference on computer vision; 2020; Springer; pp. 699–715. [Google Scholar]
  148. Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947. [Google Scholar] [CrossRef]
  149. Castro, F.M.; Marín-Jiménez, M.J.; Guil, N.; Schmid, C.; Alahari, K. End-to-end incremental learning. In Proceedings of the Proceedings of the European conference on computer vision (ECCV); 2018; pp. 233–248. [Google Scholar]
  150. Douillard, A.; Cord, M.; Ollion, C.; Robert, T.; Valle, E. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Proceedings of the Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XX 16. Springer, 2020, pp. 86–102.
  151. Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Learning a unified classifier incrementally via rebalancing. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019; pp. 831–839. [Google Scholar]
  152. Wu, C.; Herranz, L.; Liu, X.; Van De Weijer, J.; Raducanu, B.; et al. Memory replay gans: Learning to generate new categories without forgetting. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  153. Zhai, M.; Chen, L.; Tung, F.; He, J.; Nawhal, M.; Mori, G. Lifelong gan: Continual learning for conditional image generation. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision; 2019; pp. 2759–2768. [Google Scholar]
  154. Liu, X.; Masana, M.; Herranz, L.; Van de Weijer, J.; Lopez, A.M.; Bagdanov, A.D. Rotate your networks: Better weight consolidation and less catastrophic forgetting. In Proceedings of the 2018 24th international conference on pattern recognition (ICPR); IEEE, 2018; pp. 2262–2268. [Google Scholar]
  155. Benzing, F. Unifying importance based regularisation methods for continual learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR; 2022; pp. 2372–2396. [Google Scholar]
  156. Lee, S.W.; Kim, J.H.; Jun, J.; Ha, J.W.; Zhang, B.T. Overcoming catastrophic forgetting by incremental moment matching. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  157. Chaudhry, A.; Dokania, P.K.; Ajanthan, T.; Torr, P.H. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the Proceedings of the European conference on computer vision (ECCV); 2018; pp. 532–547. [Google Scholar]
  158. Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; Tuytelaars, T. Memory aware synapses: Learning what (not) to forget. In Proceedings of the Proceedings of the European conference on computer vision (ECCV); 2018; pp. 139–154. [Google Scholar]
  159. Nguyen, C.V.; Li, Y.; Bui, T.D.; Turner, R.E. Variational continual learning. arXiv 2017, arXiv:1710.10628. [Google Scholar]
  160. Chaudhry, A.; Rohrbach, M.; Elhoseiny, M.; Ajanthan, T.; Dokania, P.K.; Torr, P.H.; Ranzato, M. On tiny episodic memories in continual learning. arXiv 2019, arXiv:1902.10486. [Google Scholar] [CrossRef]
  161. Vitter, J.S. Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 1985, 11, 37–57. [Google Scholar] [CrossRef]
  162. Borsos, Z.; Mutny, M.; Krause, A. Coresets via bilevel optimization for continual learning and streaming. Adv. Neural Inf. Process. Syst. 2020, 33, 14879–14890. [Google Scholar]
  163. Yoon, J.; Madaan, D.; Yang, E.; Hwang, S.J. Online coreset selection for rehearsal-based continual learning. arXiv 2021, arXiv:2106.01085. [Google Scholar]
  164. Shim, D.; Mai, Z.; Jeong, J.; Sanner, S.; Kim, H.; Jang, J. Online class-incremental continual learning with adversarial shapley value. Proc. Proc. AAAI Conf. Artif. Intell. 2021, Vol. 35, 9630–9638. [Google Scholar]
  165. Bang, J.; Kim, H.; Yoo, Y.; Ha, J.W.; Choi, J. Rainbow memory: Continual learning with a memory of diverse samples. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021; pp. 8218–8227. [Google Scholar]
  166. Tiwari, R.; Killamsetty, K.; Iyer, R.; Shenoy, P. Gcr: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 99–108. [Google Scholar]
  167. Van Den Oord, A.; Vinyals, O.; et al. Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  168. Wang, L.; Zhang, X.; Yang, K.; Yu, L.; Li, C.; Hong, L.; Zhang, S.; Li, Z.; Zhong, Y.; Zhu, J. Memory replay with data compression for continual learning. arXiv 2022, arXiv:2202.06592. [Google Scholar] [CrossRef]
  169. Kulesza, A.; Taskar, B.; et al. Determinantal point processes for machine learning. Found. Trends Mach. Learn. 2012, 5, 123–286. [Google Scholar]
  170. Kumari, L.; Wang, S.; Zhou, T.; Bilmes, J.A. Retrospective adversarial replay for continual learning. Adv. Neural Inf. Process. Syst. 2022, 35, 28530–28544. [Google Scholar]
  171. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
  172. Belouadah, E.; Popescu, A. Il2m: Class incremental learning with dual memory. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision; 2019; pp. 583–592. [Google Scholar]
  173. Ebrahimi, S.; Petryk, S.; Gokul, A.; Gan, W.; Gonzalez, J.E.; Rohrbach, M.; Darrell, T. Remembering for the right reasons: Explanations reduce catastrophic forgetting. Appl. AI Lett. 2021, 2, e44. [Google Scholar] [CrossRef]
  174. Liu, Y.; Su, Y.; Liu, A.A.; Schiele, B.; Sun, Q. Mnemonics training: Multi-class incremental learning without forgetting. In Proceedings of the Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition; 2020; pp. 12245–12254. [Google Scholar]
  175. Jin, X.; Sadhu, A.; Du, J.; Ren, X. Gradient-based editing of memory examples for online task-free continual learning. Adv. Neural Inf. Process. Syst. 2021, 34, 29193–29205. [Google Scholar]
  176. Chaudhry, A.; Ranzato, M.; Rohrbach, M.; Elhoseiny, M. Efficient lifelong learning with a-gem. arXiv 2018, arXiv:1812.00420. [Google Scholar]
  177. Tang, S.; Chen, D.; Zhu, J.; Yu, S.; Ouyang, W. Layerwise optimization by gradient decomposition for continual learning. In Proceedings of the Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition; 2021; pp. 9634–9643. [Google Scholar]
  178. Sun, Q.; Lyu, F.; Shang, F.; Feng, W.; Wan, L. Exploring example influence in continual learning. Adv. Neural Inf. Process. Syst. 2022, 35, 27075–27086. [Google Scholar]
  179. Aljundi, R.; Belilovsky, E.; Tuytelaars, T.; Charlin, L.; Caccia, M.; Lin, M.; Page-Caccia, L. Online continual learning with maximal interfered retrieval. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  180. Chaudhry, A.; Gordo, A.; Dokania, P.; Torr, P.; Lopez-Paz, D. Using hindsight to anchor past knowledge in continual learning. Proc. Proc. AAAI Conf. Artif. Intell. 2021, Vol. 35, 6993–7001. [Google Scholar] [CrossRef]
  181. Wu, Y.; Chen, Y.; Wang, L.; Ye, Y.; Liu, Z.; Guo, Y.; Fu, Y. Large scale incremental learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019; pp. 374–382. [Google Scholar]
  182. Zhao, B.; Xiao, X.; Gan, G.; Zhang, B.; Xia, S.T. Maintaining discrimination and fairness in class incremental learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020; pp. 13208–13217. [Google Scholar]
  183. Ahn, H.; Kwak, J.; Lim, S.; Bang, H.; Kim, H.; Moon, T. Ss-il: Separated softmax for incremental learning. In Proceedings of the Proceedings of the IEEE/CVF International conference on computer vision; 2021; pp. 844–853. [Google Scholar]
  184. Cha, H.; Lee, J.; Shin, J. Co2l: Contrastive continual learning. In Proceedings of the Proceedings of the IEEE/CVF International conference on computer vision; 2021; pp. 9516–9525. [Google Scholar]
  185. Simon, C.; Koniusz, P.; Harandi, M. On learning the geodesic path for incremental learning. In Proceedings of the Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition; 2021; pp. 1591–1600. [Google Scholar]
  186. Joseph, K.; Khan, S.; Khan, F.S.; Anwer, R.M.; Balasubramanian, V.N. Energy-based latent aligner for incremental learning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022; pp. 7452–7461. [Google Scholar]
  187. Kurmi, V.K.; Patro, B.N.; Subramanian, V.K.; Namboodiri, V.P. Do not forget to attend to uncertainty while mitigating catastrophic forgetting. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; 2021; pp. 736–745. [Google Scholar]
  188. Ashok, A.; Joseph, K.; Balasubramanian, V.N. Class-incremental learning with cross-space clustering and controlled transfer. In Proceedings of the European conference on computer vision; 2022; Springer; pp. 105–122. [Google Scholar]
  189. Hu, X.; Tang, K.; Miao, C.; Hua, X.S.; Zhang, H. Distilling causal effect of data in class-incremental learning. In Proceedings of the Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition; 2021; pp. 3957–3966. [Google Scholar]
  190. Bhat, P.; Zonooz, B.; Arani, E. Task-aware information routing from common representation space in lifelong learning. arXiv 2023, arXiv:2302.11346. [Google Scholar] [CrossRef]
  191. Hou, S.; Pan, X.; Loy, C.C.; Wang, Z.; Lin, D. Lifelong learning via progressive distillation and retrospection. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV); 2018; pp. 437–452. [Google Scholar]
  192. Wang, F.Y.; Zhou, D.W.; Ye, H.J.; Zhan, D.C. Foster: Feature boosting and compression for class-incremental learning. In Proceedings of the European conference on computer vision; 2022; Springer; pp. 398–414. [Google Scholar]
  193. Wang, L.; Zhang, M.; Jia, Z.; Li, Q.; Bao, C.; Ma, K.; Zhu, J.; Zhong, Y. Afec: Active forgetting of negative transfer in continual learning. Adv. Neural Inf. Process. Syst. 2021, 34, 22379–22391. [Google Scholar]
  194. Verwimp, E.; De Lange, M.; Tuytelaars, T. Rehearsal revealed: The limits and merits of revisiting samples in continual learning. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021; pp. 9385–9394. [Google Scholar]
  195. Bonicelli, L.; Boschini, M.; Porrello, A.; Spampinato, C.; Calderara, S. On the effectiveness of lipschitz-driven rehearsal in continual learning. Adv. Neural Inf. Process. Syst. 2022, 35, 31886–31901. [Google Scholar]
  196. Yu, L.; Hu, T.; Hong, L.; Liu, Z.; Weller, A.; Liu, W. Continual learning by modeling intra-class variation. arXiv 2022, arXiv:2210.05398. [Google Scholar]
  197. Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; Calderara, S. Dark experience for general continual learning: a strong, simple baseline. Adv. Neural Inf. Process. Syst. 2020, 33, 15920–15930. [Google Scholar]
  198. Boschini, M.; Bonicelli, L.; Buzzega, P.; Porrello, A.; Calderara, S. Class-incremental continual learning into the extended der-verse. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5497–5512. [Google Scholar] [CrossRef]
  199. Prabhu, A.; Torr, P.H.; Dokania, P.K. Gdumb: A simple approach that questions our progress in continual learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 2020, pp. 524–540.
  200. Ayub, A.; Wagner, A.R. EEC: Learning to encode and regenerate images for continual learning. arXiv 2021, arXiv:2101.04904. [Google Scholar] [CrossRef]
  201. Ostapenko, O.; Puscas, M.; Klein, T.; Jahnichen, P.; Nabi, M. Learning to remember: A synaptic plasticity driven framework for continual learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019; pp. 11321–11329. [Google Scholar]
  202. Kemker, R.; Kanan, C. Fearnet: Brain-inspired model for incremental learning. arXiv 2017, arXiv:1711.10563. [Google Scholar]
  203. Riemer, M.; Klinger, T.; Bouneffouf, D.; Franceschini, M. Scalable recollections for continual lifelong learning. Proc. Proc. AAAI Conf. Artif. Intell. 2019, Vol. 33, 1352–1359. [Google Scholar] [CrossRef]
  204. Rostami, M.; Kolouri, S.; Pilly, P.K. Complementary learning for overcoming catastrophic forgetting using experience replay. arXiv 2019, arXiv:1903.04566. [Google Scholar] [CrossRef]
  205. Pfülb, B.; Gepperth, A.; Bagus, B. Continual learning with fully probabilistic models. arXiv 2021, arXiv:2104.09240. [Google Scholar] [CrossRef]
  206. Gopalakrishnan, S.; Singh, P.R.; Fayek, H.; Ramasamy, S.; Ambikapathi, A. Knowledge capture and replay for continual learning. In Proceedings of the Proceedings of the IEEE/CVF winter conference on applications of computer vision; 2022; pp. 10–18. [Google Scholar]
  207. Ye, F.; Bors, A.G. Learning latent representations across multiple data domains using lifelong VAEGAN. In Proceedings of the European Conference on Computer Vision; 2020; Springer; pp. 777–795. [Google Scholar]
  208. Seff, A.; Beatson, A.; Suo, D.; Liu, H. Continual learning in generative adversarial nets. arXiv 2017, arXiv:1705.08395. [Google Scholar] [CrossRef]
  209. He, C.; Wang, R.; Shan, S.; Chen, X. Exemplar-supported generative reproduction for class incremental learning. Proc. BMVC 2018, Vol. 1, 2. [Google Scholar]
  210. Xiang, Y.; Fu, Y.; Ji, P.; Huang, H. Incremental learning using conditional adversarial networks. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019; pp. 6619–6628. [Google Scholar]
  211. Cong, Y.; Zhao, M.; Li, J.; Wang, S.; Carin, L. Gan memory with no forgetting. Adv. Neural Inf. Process. Syst. 2020, 33, 16481–16494. [Google Scholar]
  212. Liu, X.; Wu, C.; Menta, M.; Herranz, L.; Raducanu, B.; Bagdanov, A.D.; Jui, S.; de Weijer, J.v. Generative feature replay for class-incremental learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops; 2020; pp. 226–227. [Google Scholar]
  213. Ostapenko, O.; Lesort, T.; Rodriguez, P.; Arefin, M.R.; Douillard, A.; Rish, I.; Charlin, L. Continual learning with foundation models: An empirical study of latent replay. In Proceedings of the Conference on lifelong learning agents. PMLR; 2022; pp. 60–91. [Google Scholar]
  214. Wang, Z.; Liu, L.; Duan, Y.; Tao, D. Continual learning through retrieval and imagination. Proc. Proc. AAAI Conf. Artif. Intell. 2022, Vol. 36, 8594–8602. [Google Scholar] [CrossRef]
  215. Wang, Z.; Liu, L.; Kong, Y.; Guo, J.; Tao, D. Online continual learning with contrastive vision transformer. In Proceedings of the European Conference on Computer Vision; 2022; Springer; pp. 631–650. [Google Scholar]
  216. Wang, Y.; Huang, Z.; Hong, X. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. Adv. Neural Inf. Process. Syst. 2022, 35, 5682–5695. [Google Scholar]
  217. Wang, Z.; Zhang, Z.; Ebrahimi, S.; Sun, R.; Zhang, H.; Lee, C.Y.; Ren, X.; Su, G.; Perot, V.; Dy, J.; et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In Proceedings of the European conference on computer vision; 2022; Springer; pp. 631–648. [Google Scholar]
  218. Park, C.W.; Seo, S.W.; Kang, N.; Ko, B.; Choi, B.W.; Park, C.M.; Chang, D.K.; Kim, H.; Kim, H.; Lee, H.; et al. Artificial intelligence in health care: current applications and issues. J. Korean Med. Sci. 2020, 35. [Google Scholar] [CrossRef] [PubMed]
  219. Zhu, D.; Bu, Q.; Zhu, Z.; Zhang, Y.; Wang, Z. Advancing autonomy through lifelong learning: a survey of autonomous intelligent systems. Front. Neurorobotics 2024, 18, 1385778. [Google Scholar] [CrossRef]
  220. Ciupek, D.; Malawski, M.; Pieciak, T. Federated Learning: A new frontier in the exploration of multi-institutional medical imaging data. arXiv 2025, arXiv:2503.20107. [Google Scholar]
  221. Thakur, G.K.; Thakur, A.; Kulkarni, S.; Khan, N.; Khan, S. Deep learning approaches for medical image analysis and diagnosis. Cureus 2024, 16. [Google Scholar] [CrossRef]
  222. Jeon, J.; Kim, J.; Kim, J.; Kim, K.; Mohaisen, A.; Kim, J.K. Privacy-preserving deep learning computation for geo-distributed medical big-data platforms. In Proceedings of the 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks–Supplemental Volume (DSN-S); IEEE, 2019; pp. 3–4. [Google Scholar]
  223. Pianykh, O.S.; Langs, G.; Dewey, M.; Enzmann, D.R.; Herold, C.J.; Schoenberg, S.O.; Brink, J.A. Continuous learning AI in radiology: implementation principles and early applications. Radiology 2020, 297, 6–14. [Google Scholar] [CrossRef] [PubMed]
  224. Pinto-Coelho, L. How artificial intelligence is shaping medical imaging technology: a survey of innovations and applications. Bioengineering 2023, 10, 1435. [Google Scholar] [CrossRef]
  225. Zhu, Z.; Sun, Y.; Honarvar Shakibaei; Asli, B. Early Breast Cancer Detection Using Artificial Intelligence Techniques Based on Advanced Image Processing Tools. Electronics 2024, 13, 3575. [Google Scholar] [CrossRef]
  226. da Silva Motta, D.; Badaró, R.; Santos, A.; Kirchner, F. Use of artificial intelligence on the control of vector-borne diseases; IntechOpen, 2018. [Google Scholar]
  227. Dasari, S.; Ebert, F.; Tian, S.; Nair, S.; Bucher, B.; Schmeckpeper, K.; Singh, S.; Levine, S.; Finn, C. Robonet: Large-scale multi-robot learning. arXiv 2019, arXiv:1910.11215. [Google Scholar]
  228. Haque, N. Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks. arXiv 2025, arXiv:2504.01241. [Google Scholar] [CrossRef]
  229. Yao, Y.; González-Vélez, H. AI-Powered System to Facilitate Personalized Adaptive Learning in Digital Transformation. Appl. Sci. 2025, 15, 4989. [Google Scholar] [CrossRef]
  230. Li, D.; Chen, Z.; Cho, E.; Hao, J.; Liu, X.; Xing, F.; Guo, C.; Liu, Y. Overcoming catastrophic forgetting during domain adaptation of seq2seq language generation. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2022; pp. 5441–5454. [Google Scholar]
  231. Liu, T.; Ungar, L.; Sedoc, J. Continual learning for sentence representations using conceptors. arXiv 2019, arXiv:1904.09187. [Google Scholar] [CrossRef]
  232. Monaikul, N.; Castellucci, G.; Filice, S.; Rokhlenko, O. Continual learning for named entity recognition. Proc. Proc. AAAI Conf. Artif. Intell. 2021, Vol. 35, 13570–13577. [Google Scholar] [CrossRef]
  233. Li, G.; Zhai, Y.; Chen, Q.; Gao, X.; Zhang, J.; Zhang, Y. Continual few-shot intent detection. In Proceedings of the Proceedings of the 29th international conference on computational linguistics; 2022; pp. 333–343. [Google Scholar]
  234. Liu, Q.; Yu, X.; He, S.; Liu, K.; Zhao, J. Lifelong intent detection via multi-strategy rebalancing. arXiv 2021, arXiv:2108.04445. [Google Scholar] [CrossRef]
  235. Varshney, V.; Patidar, M.; Kumar, R.; Shroff, G.; Vig, L. Prompt augmented generative replay via supervised contrastive training for lifelong intent detection, 2024. US Patent App. 18/215,972.
  236. Qin, C.; Joty, S. Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. arXiv 2021, arXiv:2110.07298. [Google Scholar]
  237. Sun, J.; Wang, S.; Zhang, J.; Zong, C. Distill and replay for continual language learning. In Proceedings of the Proceedings of the 28th international conference on computational linguistics; 2020; pp. 3569–3579. [Google Scholar]
  238. Cao, Y.; Wei, H.R.; Chen, B.; Wan, X. Continual learning for neural machine translation. In Proceedings of the Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2021; pp. 3964–3974. [Google Scholar]
  239. Shao, C.; Feng, Y. Overcoming catastrophic forgetting beyond continual learning: Balanced training for neural machine translation. arXiv 2022, arXiv:2203.03910. [Google Scholar] [CrossRef]
  240. Qin, Y.; Zhang, J.; Lin, Y.; Liu, Z.; Li, P.; Sun, M.; Zhou, J. Elle: Efficient lifelong pre-training for emerging data. arXiv 2022, arXiv:2203.06311. [Google Scholar] [CrossRef]
  241. Huang, Y.; Zhang, Y.; Chen, J.; Wang, X.; Yang, D. Continual learning for text classification with information disentanglement based regularization. arXiv 2021, arXiv:2104.05489. [Google Scholar] [CrossRef]
  242. de Masson D’Autume, C.; Ruder, S.; Kong, L.; Yogatama, D. Episodic memory in lifelong language learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  243. Wang, Z.; Mehta, S.V.; Póczos, B.; Carbonell, J. Efficient meta lifelong-learning with limited memory. arXiv 2020, arXiv:2010.02500. [Google Scholar] [CrossRef]
  244. Xu, K.; Verma, S.; Finn, C.; Levine, S. Continual learning of control primitives: Skill discovery via reset-games. Adv. Neural Inf. Process. Syst. 2020, 33, 4999–5010. [Google Scholar]
  245. Mi, F.; Chen, L.; Zhao, M.; Huang, M.; Faltings, B. Continual learning for natural language generation in task-oriented dialog systems. arXiv 2020, arXiv:2010.00910. [Google Scholar] [CrossRef]
  246. Li, Z.; Qu, L.; Haffari, G. Total recall: a customized continual learning method for neural semantic parsers. arXiv 2021, arXiv:2109.05186. [Google Scholar] [CrossRef]
  247. Sun, F.K.; Ho, C.H.; Lee, H.Y. Lamol: Language modeling for lifelong language learning. arXiv 2019, arXiv:1909.03329. [Google Scholar] [CrossRef]
  248. Zhang, Y.; Wang, X.; Yang, D. Continual sequence generation with adaptive compositional modules. arXiv 2022, arXiv:2203.10652. [Google Scholar] [CrossRef]
  249. Wang, R.; Yu, T.; Zhao, H.; Kim, S.; Mitra, S.; Zhang, R.; Henao, R. Few-shot class-incremental learning for named entity recognition. Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics 2022, Volume 1, 571–582. [Google Scholar]
  250. Geng, B.; Yuan, F.; Xu, Q.; Shen, Y.; Xu, R.; Yang, M. Continual learning for task-oriented dialogue system with iterative network pruning, expanding and masking. arXiv 2021, arXiv:2107.08173. [Google Scholar] [CrossRef]
  251. Shen, Y.; Zeng, X.; Jin, H. A progressive model to enable continual learning for semantic slot filling. In Proceedings of the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019; pp. 1279–1284. [Google Scholar]
  252. Wang, C.; Pan, H.; Liu, Y.; Chen, K.; Qiu, M.; Zhou, W.; Huang, J.; Chen, H.; Lin, W.; Cai, D. Mell: Large-scale extensible user intent classification for dialogue systems with meta lifelong learning. In Proceedings of the Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining; 2021; pp. 3649–3659. [Google Scholar]
  253. Wu, T.; Li, X.; Li, Y.F.; Haffari, G.; Qi, G.; Zhu, Y.; Xu, G. Curriculum-meta learning for order-robust continual relation extraction. Proc. Proc. AAAI Conf. Artif. Intell. 2021, Vol. 35, 10363–10369. [Google Scholar] [CrossRef]
  254. Madotto, A.; Lin, Z.; Zhou, Z.; Moon, S.; Crook, P.; Liu, B.; Yu, Z.; Cho, E.; Wang, Z. Continual learning in task-oriented dialogue systems. arXiv 2020, arXiv:2012.15504. [Google Scholar] [CrossRef]
  255. Ermis, B.; Zappella, G.; Wistuba, M.; Rawal, A.; Archambeau, C. Memory efficient continual learning with transformers. Adv. Neural Inf. Process. Syst. 2022, 35, 10629–10642. [Google Scholar]
  256. Zhu, Q.; Li, B.; Mi, F.; Zhu, X.; Huang, M. Continual prompt tuning for dialog state tracking. arXiv 2022, arXiv:2203.06654. [Google Scholar] [CrossRef]
  257. Liu, M.; Chang, S.; Huang, L. Incremental prompting: Episodic memory prompt for lifelong event detection. arXiv 2022, arXiv:2204.07275. [Google Scholar] [CrossRef]
  258. Yin, W.; Li, J.; Xiong, C. Contintin: Continual learning from task instructions. arXiv 2022, arXiv:2203.08512. [Google Scholar]
  259. Xia, C.; Yin, W.; Feng, Y.; Yu, P. Incremental few-shot text classification with multi-round new classes: Formulation, dataset and system. arXiv 2021, arXiv:2104.11882. [Google Scholar]
  260. Wang, L.; Xie, J.; Zhang, X.; Huang, M.; Su, H.; Zhu, J. Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality. Adv. Neural Inf. Process. Syst. 2023, 36, 69054–69076. [Google Scholar]
  261. Wang, Z.; Zhang, Z.; Lee, C.Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; Pfister, T. Learning to prompt for continual learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2022; pp. 139–149. [Google Scholar]
  262. Geishauser, C.; van Niekerk, C.; Lubis, N.; Heck, M.; Lin, H.C.; Feng, S.; Gašić, M. Dynamic dialogue policy for continual reinforcement learning. arXiv 2022, arXiv:2204.05928. [Google Scholar] [CrossRef]
  263. Wang, W.; Zhang, J.; Li, Q.; Hwang, M.Y.; Zong, C.; Li, Z. Incremental learning from scratch for task-oriented dialogue systems. arXiv 2019, arXiv:1906.04991. [Google Scholar] [CrossRef]
  264. Pasunuru, R.; Stoyanov, V.; Bansal, M. Continual few-shot learning for text classification. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021; pp. 5688–5702. [Google Scholar]
  265. Qin, C.; Joty, S. Continual few-shot relation learning via embedding space regularization and data augmentation. arXiv 2022, arXiv:2203.02135. [Google Scholar] [CrossRef]
  266. Ren, H.; Cai, Y.; Chen, X.; Wang, G.; Li, Q.; et al. A two-phase prototypical network model for incremental few-shot relation classification; Association for Computational Linguistics (ACL), 2020. [Google Scholar]
  267. Garcia, X.; Constant, N.; Parikh, A.P.; Firat, O. Towards continual learning for multilingual machine translation via vocabulary substitution. arXiv 2021, arXiv:2103.06799. [Google Scholar] [CrossRef]
  268. Gu, S.; Feng, Y. Investigating catastrophic forgetting during continual training for neural machine translation. arXiv 2020, arXiv:2011.00678. [Google Scholar] [CrossRef]
  269. Yan, S.; Hong, L.; Xu, H.; Han, J.; Tuytelaars, T.; Li, Z.; He, X. Generative negative text replay for continual vision-language pretraining. In Proceedings of the European Conference on Computer Vision; 2022; Springer; pp. 22–38. [Google Scholar]
  270. Greco, C.; Plank, B.; Fernández, R.; Bernardi, R. Psycholinguistics meets continual learning: Measuring catastrophic forgetting in visual question answering. arXiv 2019, arXiv:1906.04229. [Google Scholar] [CrossRef]
  271. Srinivasan, T.; Chang, T.Y.; Pinto Alva, L.; Chochlakis, G.; Rostami, M.; Thomason, J. Climb: A continual learning benchmark for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2022, 35, 29440–29453. [Google Scholar]
  272. Martínez-Plumed, F.; Ferri, C.; Hernández-Orallo, J.; Ramírez-Quintana, M.J. Forgetting and consolidation for incremental and cumulative knowledge acquisition systems. arXiv 2015, arXiv:1502.05615. [Google Scholar] [CrossRef]
  273. Christakopoulou, K.; Lalama, A.; Adams, C.; Qu, I.; Amir, Y.; Chucri, S.; Vollucci, P.; Soldo, F.; Bseiso, D.; Scodel, S.; et al. Large language models for user interest journeys. arXiv 2023, arXiv:2305.15498. [Google Scholar] [CrossRef]
  274. Wang, X.J.; Lee, C.P.; Mutlu, B. LearnMate: Enhancing Online Education with LLM-Powered Personalized Learning Plans and Support. In Proceedings of the Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems; 2025; pp. 1–10. [Google Scholar]
  275. Sabeima, M.; Lamolle, M.; Nanne, M.F. Towards personalized adaptive learning in e-learning recommender systems. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 14–20. [Google Scholar] [CrossRef]
  276. Joy, J.; Raj, N.S.; VG, R. Ontology-based E-learning content recommender system for addressing the pure cold-start problem. ACM J. Data Inf. Qual. 2021, 13, 1–27. [Google Scholar] [CrossRef]
  277. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  278. Bountouni, N.; Koussouris, S.; Vasileiou, A.; Kazazis, S.A. A Holistic Framework for Safeguarding of SMEs: A Case Study. In Proceedings of the 2023 19th International Conference on the Design of Reliable Communication Networks (DRCN); IEEE, 2023; pp. 1–5. [Google Scholar]
  279. Asmar, M.; Tuqan, A. Integrating machine learning for sustaining cybersecurity in digital banks. Heliyon 2024, 10. [Google Scholar] [CrossRef]
  280. Ahmed, U.; Nazir, M.; Sarwar, A.; Ali, T.; Aggoune, E.H.M.; Shahzad, T.; Khan, M.A. Signature-based intrusion detection using machine learning and deep learning approaches empowered with fuzzy clustering. Sci. Rep. 2025, 15, 1726. [Google Scholar]
  281. Dohare, S.; Hernandez-Garcia, J.F.; Lan, Q.; Rahman, P.; Mahmood, A.R.; Sutton, R.S. Loss of plasticity in deep continual learning. Nature 2024, 632, 768–774. [Google Scholar] [CrossRef]
  282. Mohammed, K. Harnessing the Speed and Accuracy of Machine Learning to Advance Cybersecurity. arXiv 2023, arXiv:2302.12415. [Google Scholar]
  283. Rahul-Vigneswaran, K.; Poornachandran, P.; Soman, K. A compendium on network and host based intrusion detection systems. In Proceedings of the ICDSMLA 2019: Proceedings of the 1st International Conference on Data Science, Machine Learning and Applications; 2020; Springer; pp. 23–30. [Google Scholar]
  284. Stokes, J.W.; Wang, D.; Marinescu, M.; Marino, M.; Bussone, B. Attack and defense of dynamic analysis-based, adversarial neural malware classification models. arXiv 2017, arXiv:1712.05919. [Google Scholar] [CrossRef]
  285. Sameen, M.; Han, K.; Hwang, S.O. PhishHaven—An efficient real-time AI phishing URLs detection system. Ieee Access 2020, 8, 83425–83443. [Google Scholar] [CrossRef]
Figure 1. A conceptual overview of CL: A. CL involves adapting to a sequence of tasks where data distributions evolve over time (detailed in Section 2.1). B. An effective approach must balance stability (red arrow) and plasticity (blue arrow), while also maintaining generalization across both within-task (green arrow) and across-task (olive green arrow) distribution shifts (detailed in Section 4). C. To meet these goals, various methods have been developed, each focusing on different facets of the ML pipeline (detailed in Section 6). D. CL is also applied in real-world scenarios, addressing issues such as increasing task complexity and the need for task-aware solutions (detailed in Section 10). This figure is adapted from [13].
Figure 1. A conceptual overview of CL: A. CL involves adapting to a sequence of tasks where data distributions evolve over time (detailed in Section 2.1). B. An effective approach must balance stability (red arrow) and plasticity (blue arrow), while also maintaining generalization across both within-task (green arrow) and across-task (olive green arrow) distribution shifts (detailed in Section 4). C. To meet these goals, various methods have been developed, each focusing on different facets of the ML pipeline (detailed in Section 6). D. CL is also applied in real-world scenarios, addressing issues such as increasing task complexity and the need for task-aware solutions (detailed in Section 10). This figure is adapted from [13].
Preprints 214096 g001
Figure 2. Classification of CL approaches according to the characteristics of their training and inference configurations.
Figure 2. Classification of CL approaches according to the characteristics of their training and inference configurations.
Preprints 214096 g002
Table 1. Comparison of representative CL survey papers.
Table 1. Comparison of representative CL survey papers.
Survey Year Main Focus CL Coverage Modern Trends Main Limitation
Wang et al.[13] 2024 General CL theory and methods TIL, DIL, and CIL Limited discussion of prompting and PEFT Minimal focus on foundation-model adaptation and modern multimodal CL
Van de Ven et al.[21] 2022 Taxonomy of CL scenarios TIL, DIL, and CIL Does not cover recent CL trends Primarily focused on conceptual categorization of CL settings
Bidaki et al.[22] 2025 Online CL Streaming and online CL Benchmark-oriented discussion Narrow scope centered on online learning settings
Zhou et al.[23] 2024 Class-incremental learning Mainly CIL Limited multimodal and foundation-model discussion Restricted primarily to CIL strategies and benchmarks
Wickramasinghe et al.[24] 2023 Overview of CL methods General CL settings Covers traditional CL methods Limited synthesis of transformer- and prompt-based CL methods
This review 2026 Modern CL trends, evaluation gaps, and deployment challenges TIL, DIL, CIL, online, multimodal, and federated CL Prompt learning, PEFT, foundation models, and diffusion models Discusses evaluation inconsistency, benchmark fragmentation, deployment challenges, and emerging large-scale CL directions
Table 2. Summary table of difference.
Table 2. Summary table of difference.
Feature CL Transfer Learning MTL Online Learning
Task Availability Sequential One-time transfer Simultaneous Single task
Focus Learning without
forgetting
Knowledge transfer Shared
representation
Incremental
updates
Addresses Forgetting Yes No No No
Data Distribution Non-stationary Varies Varies Stationary
Table 3. CL scenarios.
Table 3. CL scenarios.
Scenario Task Label
Know
Output
Space
Data
Distribution
Example Application
Task-Incremental Yes Varies Changes Multi-task NLP,
robotics
Domain Incremental NO Same Changes Handwriting recognition,
IoT Sensors
Class-Incremental No Expands Changes Image classification,
object detection
Instance-
Incremental
N/A Same Same (new
data)
Spam filtering, online
analytics
Unsupervised/Other N/A N/A Changes Clustering, RL in dynamic
settings
Table 4. Overview of TIL.
Table 4. Overview of TIL.
Aspect Details
Definition Models learn a sequence of distinct tasks, with task identity
provided during both training and inference.
Core Challenge Maintaining task-specific performance without interference
between tasks (catastrophic forgetting).
Inference Requirement Task identity is known, allowing the model to use task-specific
components (e.g., separate output heads).
Key Techniques - Task-specific output heads.
- Parameter isolation (dedicated parameters for each task.
- Regularization to preserve important parameters.
Advantages - Robust retention of task-specific knowledge.
- Simplified learning due to known task boundaries and identities.
Challenges - Scalability issues with a growing number of tasks.
- Limited knowledge transfer between tasks.
Example Applications - Sequential learning of different object categories (e.g., animals, vehicles).
- Robotics: learning distinct tasks like grasping and navigation.
- Diagnostic systems for different modalities (e.g., X-rays, MRIs).
Evaluation Metrics - Task-specific accuracy.
- Memory and computational efficiency for handling multiple tasks.
Future Directions - Modular architectures to balance task isolation and scalability.
- Approaches to enable knowledge transfer across tasks.
Table 5. Overview of DIL
Table 5. Overview of DIL
Aspect Details
Definition Models learn to adapt to new data distributions (domains)
over time while maintaining the same task objective.
Core Challenge Adapting to new domains without forgetting knowledge of
previously learned domains (catastrophic forgetting).
Inference Requirement Task identity is unknown; the model must generalize across
domains without explicit domain information.
Key Techniques - Domain adaptation methods (e.g., feature alignment).
- Regularization techniques to retain domain-invariant features
- Memory replay or dynamic models to balance old and new
knowledge.
Advantages - Allows systems to handle non-stationary data distributions.
- Maintains consistent tasks performance across multiple domains.
Challenges - Catastrophic forgetting when adapting to new domains.
- Handling domain-specific biases while ensuring generalization.
- Computational and memory constraints as new domains increase.
Example Applications - Object recognition in different environmental conditions
(e.g., sunny, foggy, rainy).
- Medical imaging systems adapting to scans from different
hospitals or devices.
- NLP tasks such as sentiment analysis across different domains
(e.g., movie reviews, product reviews).
Evaluation Metrics - Performance consistency across domains.
-Forgetting rate for previously learned domains.
- Domain generalization ability on unseen domains.
Future Directions - Efficient methods for domain adaptation without overfitting
to new domains.
- Scalable approaches to handle increasing numbers of domains.
- Techniques to balance domain-specific and domain-invariant
learning.
Table 6. Overview of CIL.
Table 6. Overview of CIL.
Aspect Details
Definition Models learn new classes sequentially, and the task identity is not
provided during inference.
Core Challenge Catastrophic forgetting-new learning overwrites knowledge of
previously learned classes.
Inference Requirement Model must classify inputs across all learned classes without
knowledge of task identity.
Key Techniques - Memory replay (storing/replaying previous class examples)
- KD (preserving learned representations)
- Dynamic architecture (expanding capacity for new classes)
Advantages - Enables incremental learning without full retraining.
- Efficient handling of scenarios where new class data is
available over time.
Challenges - Handling class imbalance, as new classes often have fewer
examples.
- Managing memory and computational costs as the number
of classes increases.
Example Applications - Extending image classifiers with new object categories.
- Autonomous vehicles learning new traffic signs and objects.
- Healthcare models adapting to diagnose new diseases.
Evaluation Metrics - Accuracy across all classes (old and new).
- Forgetting rate (performance drop on previously learned
classes).
Future Directions - Scalable memory-efficient replay methods.
- Adaptive architectures that balance stability and plasticity.
- Improved algorithms for mitigating class imbalance and
preserving older class knowledge.
Table 7. Overview of data-incremental learning.
Table 7. Overview of data-incremental learning.
Aspect Details
Definition Models learn incrementally from a stream of data instances,
which may belong to existing or new classes, without
explicit task boundaries.
Core Challenge Adapting to new data while retaining knowledge of
previously learned data, especially without clear
transitions or task identities.
Inference Requirement The model must classify instances across all learned
classes without explicit knowledge of when new data or
classes were introduced.
Key Techniques - Memory replay (storing or generating past data).
- Regularization techniques to preserve critical parameters.
- Dynamic architectures for flexible capacity adjustment.
Advantages - Handles continuously evolving data streams.
- Allows for learning without task-specific information or
retraining.
Challenges - Managing catastrophic forgetting as new data arrives.
- Handling class imbalance and unstructured data streams.
- Resource efficiency for memory and computational costs.
Example applications - Object recognition systems that adapt to new categories
dynamically.
- Recommendation systems updating preferences with new
user data and items.
- Continuous monitoring systems in healthcare,
incorporating evolving signals from wearable devices.
Evaluation Metrics - Accuracy across all classes (old and new).
- Forgetting rate (performance drop on previously learned
data.
- Adaptation speed to new data.
Future directions - Hybrid methods combining memory replay with adaptive
architectures.
- Scalable solutions for handling large and imbalanced data
streams.
- Techniques for efficient data prioritization and
representation learning.
Table 8. Overview of other emerging paradigms in CL.
Table 8. Overview of other emerging paradigms in CL.
Paradigm Description Key Challenges Key Techniques Example Applications
Few-Shot
CL
Models learn new tasks
or classes with minimal
labeled data while
retaining prior knowledge.
- Adapting with
limited data.
- Meta-learning. - Rare disease
diagnosis.
- Avoiding
catastrophic
forgetting.
- Episodic
memory.
- Few-shot
object
recognition.
- Generative
replay.
Unsupervised
CL
Models learn from data
streams without explicit
labels by discovering
patterns or structures.
- Extracting
meaningful
features from
unlabeled data.
- Self-
supervised
learning.
- Video
surveillance
anomaly detection.
- Balancing old
and new pattern
representations.
- Contrastive
learning.
- Social media
trend analysis.
- Clustering
methods.
Meta-Continual
Learning
Combines meta-learning
with CL
to enable rapid adaptation
to new tasks.
- Balancing fast
adaptation with
knowledge
retention.
- Gradient-
based meta-
learning.
- Personalized
AI assistants.
- Stability-plasticity
tradeoff.
- Memory-
augmented
neural networks.
- Adaptive
recommendation
systems.
Federated
CL
Models learn incrementally
across distributed nodes
while preserving privacy.
- Handling
heterogeneous
data distributions
across nodes.
- Decentralized
learning
algorithms.
- Personalized
healthcare
monitoring.
- Avoiding
forgetting
across
distributed
devices.
- Secure
aggregation
protocols.
- Mobile
device
personalization.
- Privacy concerns. - Adaptive
synchronization
methods.
Multi-Agent
CL
Multiple agents learn and
adapt in a shared environment
while interacting and
collaborating.
- Coordinating
knowledge
transfer between
agents.
- Communication
protocols.
- Collaborative
robotics.
- Managing inter-agent
dependencies
and scalability.
- Shared memory
systems.
- Distributed
sensor networks.
- Ensemble learning.
Table 9. Overview of major theoretical foundations in CL.
Table 9. Overview of major theoretical foundations in CL.
Concept Core Idea Main Challenge Representative Strategies
Stability-Plasticity Dilemma Balancing retention of prior knowledge with adaptation to new information Excessive stability limits adaptation, while excessive plasticity causes forgetting Regularization, replay mechanisms, adaptive architectures
Catastrophic Forgetting Learning new tasks degrades performance on earlier tasks Parameter interference and overlapping representations Replay methods, parameter isolation, knowledge distillation, regularization
Forward and Backward Transfer Leveraging previous knowledge to improve future learning and vice versa Avoiding negative transfer across tasks Shared representations, multi-task learning, transferable feature learning
Representation Learning Learning reusable and task-invariant feature representations Separating task-specific and generalizable features Self-supervised learning, contrastive learning, feature disentanglement
Neuroscientific Inspiration Drawing inspiration from biological memory and adaptation mechanisms Translating biological principles into scalable AI systems Synaptic consolidation, rehearsal mechanisms, dynamic expansion
Practical and Ethical Considerations Ensuring reliable and responsible continual adaptation Resource constraints, fairness, privacy, and safety Lightweight models, federated learning, fairness-aware training
Table 10. Overview of catastrophic forgetting.
Table 10. Overview of catastrophic forgetting.
Aspect Description
Definition The significant loss of performance on previously
learned tasks when a neural network learns new tasks.
Cause Overwriting of neural network parameters due to global
updates during training on new tasks.
Key Mechanism Parameter Drift: Critical parameters for previous
tasks are modified to optimize new task learning.
Factors Exacerbating
Forgetting
- Overlapping representations shared by different tasks.
- Sequential data access without revisiting earlier
tasks.
- Lack of task awareness during inference in
class/domain-incremental settings.
Examples - A model trained to classify animals forgetting how to
classify vehicles after learning new classes.
- An object detection model in autonomous driving failing
to recognize stop signs after adapting to new road signs.
Table 11. Overview of mitigating strategies.
Table 11. Overview of mitigating strategies.
Mitigation Strategy Description Examples
Regularization
Methods
Introduce constraints during training
to prevent significant updates to
parameters crucial for earlier tasks.
- EWC:
Penalizes parameter changes.
- Synaptic Intelligence: Tracks
parameter importance.
Replay-Based Methods Retain and replay data from previous
tasks during training on new tasks.
- Experience Replay: Stores a
subset of prior task data.
- Generative Replay: Generates
synthetic data from past tasks.
Dynamic Architectures Expand or adapt the network
architecture to allocate new
resources for each task.
- Progressive Neural Networks:
Adds new parameters per task.
- Dynamically expandable
networks.
Representation Learning Learn generalizable features
that can be reused across
tasks, reducing task-specific
interference.
- Self-supervised pretraining.
- Disentangled representations.
Hybrid Approaches Combine multiple strategies,
such as regularization with
replay or dynamic architectures.
- Replay with EWC to balance
plasticity and stability.
Evaluation Metrics Description
Forgetting Rate Measures the drop in performance on previously learned tasks after learning new ones.
Accuracy Assesses performance across all tasks (old and new).
Knowledge Transfer Evaluates how well the model uses previous knowledge to improve learning on new tasks.
Table 13. Comparison of major CL method categories.
Table 13. Comparison of major CL method categories.
Method Category Key Characteristics Main Challenges Memory Cost Typical Applications
Regularization-Based Methods Constrain parameter updates to preserve previous knowledge; memory efficient and easy to integrate Limited performance under severe domain shifts and long task sequences Low Task-incremental learning, resource-constrained systems, privacy-sensitive applications
Replay-Based Methods Replay stored or generated samples to reinforce previous knowledge; strong retention performance Replay buffer management, privacy concerns, and storage overhead Moderate-High Class-incremental learning, reinforcement learning, streaming adaptation
Architecture-Based Methods Allocate task-specific modules or expandable subnetworks to reduce interference Poor scalability due to parameter growth and increasing model complexity High Task-incremental learning with explicit task boundaries
Optimization-Based Methods Modify gradient updates to balance stability and plasticity during training High computational complexity and optimization overhead Moderate Gradient-constrained continual adaptation and stability-focused learning
Representation-Learning Methods Learn transferable and domain-invariant feature representations across tasks Representation drift under highly heterogeneous task distributions Low-Moderate Domain-incremental learning and self-supervised continual adaptation
Prompt-Based and PEFT Methods Adapt pretrained foundation models using prompts, adapters, or low-rank updates Prompt interference, adapter scalability, and long-term stability Low Foundation models, multimodal systems, large-scale deployment
Federated and Privacy-Aware CL Enable CL across distributed clients without centralized data sharing Client drift, communication overhead, and heterogeneous data distributions Moderate Healthcare, finance, edge AI, and mobile systems
Table 14. Summary of key applications of CL, focusing on their description, benefits, and examples.
Table 14. Summary of key applications of CL, focusing on their description, benefits, and examples.
Application Area Description Key Benefits Examples
Healthcare
and Medical
Imaging
Enables dynamic adaptation
to evolving medical
knowledge, diseases, and
patient data over time.
Personalized diagnostics,
improved adaptability, and
long-term patient monitoring.
Radiology systems adapting
to new imaging techniques
or emerging diseases like
novel cancer types.
Robotics and
Autonomous
Systems
Allows robots and autonomous
systems to learn new tasks,
adapt to dynamic environments,
and retain prior knowledge.
Efficient task performance,
knowledge transfer, and
adaptability in real-world
scenarios.
Household robots learning
new cleaning techniques
while retaining old
capabilities like object
recognition.
Natural
Language
Processing
(NLP)
Helps models stay updated with
evolving language patterns,
domain-specific knowledge,
and user preferences.
Better understanding of new
language constructs,
improved domain adaptation,
and enhanced usability.
Chatbots adapting to new
slang or technical jargon
while maintaining general
conversational abilities.
Recommender
Systems
Adapts to changing user
preferences and updates
content or product catalogs
dynamically.
Improved user engagement,
personalized recommendations,
and scalability for diverse user
bases.
Streaming platforms
suggesting trending shows
based on current
preferences without
forgetting past ones.
Cybersecurity Learns from new attack
patterns and threat vectors
while retaining the ability
to recognize older threats.
Improved security, real-time
threat detection, and reduced
vulnerability to emerging
cyberattacks.
Intrusion detection systems
identifying novel malware
while protecting against
traditional viruses.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated